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A  broad  class  of  models  that  imply  structure  on  both  the  joint 
and  marginal  distributions  of  multivariate  categorical  (ordinal  or  nominal) 
responses  is  introduced.  These  parsimonious  models  can  be  used  to  si- 
multaneously describe  the  marginal  distributions  of  the  responses  and  the 
association  structure  among  the  responses.  As  a  special  case,  this  class 
of  models  includes  classical  log-  and  logit-linear  models.  In  this  sense, 
we  address  model  fitting  for  multivariate  polytomous  response  data  from 
a  very  general  perspective.  Simultaneous  models  for  joint  and  marginal 
distributions  are  useful  in  a  variety  of  applications,  including  longitudinal 
studies  and  studies  dealing  with  social  mobility  and  inter-rater  agreement. 
We  outline  a  maximum  likelihood  fitting  algorithm  that  can  be  used  for 
fitting  a  large  class  of  models  that  includes  the  class  of  simultaneous  models. 
The  algorithm  uses  Lagrange's  method  of  undetermined  multipliers  and  a 
modified  Newton-Raphson  iterative  scheme.  We  also  discuss  goodness-of-fit 
tests  and  model-based  inferences.  Inferences  for  certain  model  parameters 
are  shown  to  be  equivalent  for  product-Poisson  and  product-multinomial 
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sampling  assumptions.  This  useful  equivalence  result  generalizes  existing 
results.  The  models  and  fitting  method  are  illustrated  for  several  applications. 
Missing  data  are  often  a  problem  for  multivariate  response  data.  We 
consider  inferences  about  loglinear  models  for  which  only  certain  disjoint 
sums  of  the  data  are  observable.  We  derive  an  explicit  formula  for  the 
observed  information  matrix  associated  with  the  loglinear  parameters  that 
is  intuitively  appealing  and  simple  to  evaluate.  The  observed  information 
matrix  can  be  evaluated  at  the  maximum  likelihood  estimates  and  inverted 
to  obtain  an  estimate  of  the  precision  of  the  loglinear  parameter  estimates. 
The  EM-algorithm  can  be  used  to  fit  these  incomplete  data  logHnear  models. 
We  describe  this  algorithm  in  some  detail,  paying  special  attention  to  the 
Poisson  loglinear  model  fitting  case.  Alternative  fitting  algorithms  are  also 
outlined.  One  proposed  alternative  uses  both  the  EM  and  Newton-Raphson 
algorithm,  thereby  resulting  in  a  faster,  more  stable,  algorithm.  We  illustrate 
the  utility  of  these  results  using  latent  class  model  fitting. 
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CHAPTER  1 
INTRODUCTION 


1.1     A  Brief  Introduction  to  the  Problem 

There  are  many  situations  when  multiple  responses  are  observed  for  each 
'subject'  in  a  group,  or  several  groups.  Here  'subject'  is  generically  used  to 
refer  to  a  randomly  chosen  object  that  generates  responses.  The  multiple 
responses  could  represent  repeated  measurements  taken  on  subjects  over  time 
or  occasions.  They  could  be  the  ratings  assigned  by  several  judges  that  all 
viewed  and  rated  the  same  set  of  shdes  (here,  the  'subjects'  are  the  slides). 
Or,  perhaps,  it  may  be  that  several  distinct  or  noncommensurate  responses 
are  recorded  for  each  subject.  These  responses  are  often  categorical — ordinal 
or  nominal — and  inevitably  interrelated.  This  dissertation  addresses  issues 
related  to  modeling  and  model  fitting  for  multivariate  categorical  (ordinal  or 
nominal)  responses. 

Models  for  multivariate  categorical  response  data  are  usually  developed 
to  answer  questions  about  (i)  the  association  structure  among  the  multiple 
responses  or  (ii)  the  behavior  of  the  marginal  distributions  of  the  response 
variables.  Specifically,  a  typical  question  of  the  first  type  is,  "How  are  the 
responses  interrelated  and  is  this  interrelationship  the  same  across  the  levels 
of  the  covariates?"  A  typical  type  ii  question  is,  "How  do  the  (marginal) 
responses  depend  on  the  covariates  or  occasions?"  Historically,  many  models 
(e.g.     log-  and  logit-linear  models)  have  been  developed  for  the  primary 
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purpose  of  answering  the  type  i  questions.  Many  of  these  models  can  easily 
be  fitted  using  maximum  likelihood  (ML)  methods.  These  models  typically, 
however,  are  not  useful  for  answering  the  type  ii  questions  (Cox,  1972). 
Marginal  models — those  models  used  to  answer  type  ii  questions — are  not 
as  well  developed.  One  reason  for  this  is  that  ML  fitting  of  these  marginal 
models  is  more  difficult.  At  present,  the  method  of  weighted  least  squares 
(WLS)  is  used  almost  exclusively  for  fitting  these  models. 

Suppose  that  we  are  interested  in  answering  questions  of  both  types 
i  and  ii.  Usually  the  questions  are  addressed  using  two  different  models,  a 
joint  distribution  model  and  a  marginal  model,  and  fitting  them  separately.  It 
seems  reasonable  to  want  a  model  that  can  be  used  to  address  simultaneously 
both  questions.  That  is,  we  would  like  a  model  that  simultaneously  implies 
structure  on  both  the  joint  and  marginal  distribution  parameters.  To  date, 
there  has  been  very  little  work  done  on  the  development  and  fitting  of  these 
simultaneous  models. 

Whenever  multiple  responses  are  observed  it  is  inevitable  that  there  will 
be  missing  data.  There  are  several  ways  to  fit  the  Poisson  loglinear  model  with 
incomplete  data.  One  popular  method  is  to  use  the  EM  algorithm  to  find  the 
ML  estimates  of  the  loglinear  parameters.  One  drawback  to  this  algorithm 
is  that  a  precision  estimate  of  the  ML  estimators  is  not  produced  as  a  by- 
product. Several  numerical  techniques  have  been  developed  to  approximate 
the  observed  information  matrix,  which,  upon  inversion,  will  act  as  the 
precision  estimate.  However,  it  would  be  of  some  convenience  to  derive  an 
explicit  formula  for  the  observed  information  matrix,  at  least  in  some  special 
cases. 
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1.2     Outline  of  Existing  Methodologies — No  Missing  Data 

We  begin  our  discussion  by  considering  the  case  of  no  missing  data. 
There  are  many  methods  for  analyzing  multivariate  categorical  (ordinal  or 
nominal)  response  data.  These  methods  usually  involve  fitting  (separately) 
models  for  the  joint  or  the  marginal  distributions  of  the  response  vectors. 
In  rare  instances,  simultaneous  models  for  both  the  joint  and  marginal 
distributions  are  considered.  Maximum  likelihood  fitting  methods  for  the 
joint  distribution  models  are  simple  and  described  in  almost  every  standard 
text  on  categorical  data  analysis.  The  fitting  of  marginal  models  using 
ML  methods  is  more  difficult.  Maximum  likelihood  fitting  of  the  marginal 
homogeneity  model  was  considered  by  Madansky  (1963)  and  Lipsitz  (1988). 
The  fitting  of  a  more  general  class  of  marginal  models  was  considered 
by  Haber  (1985a).  Finally,  the  fitting  of  simultaneous  models  using  ML 
methods  has  only  been  addressed  in  the  bivariate  response  case.  The  fitting 
technique  becomes  very  complicated  when  there  are  more  than  two  categorical 
responses.  To  appreciate  the  complexity  of  extending  the  technique  to 
multivariate  response  data,  see  section  6.5  of  McCuUagh  and  Nelder  (1989) 
or  perhaps  Dale  (1986).  In  contrast,  the  ML  fitting  method  of  Chapter  2  can 
easily  be  used  to  fit  many  marginal  and  simultaneous  models.  In  the  next  few 
paragraphs,  we  briefly  describe  the  existing  methods  for  modeling  and  model 
fitting  for  multivariate  categorical  response  data. 

Modeling  Joint  Distributions  Separately.  One  common  method  for  analyz- 
ing multivariate  categorical  responses  is  to  model  the  joint  distribution  only. 
These  models,  which  include  classical  log-  and  logit-Unear  models  for  the 
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joint  probabilities,  are  useful  for  describing  the  association  structure  among 
the  responses.  The  last  30  years  have  seen  the  development  of  these  methods 
for  analyzing  multivariate  categorical  responses  (Haberman,  1979;  Bishop  et 
al.,  1975;  Agresti,  1984,  1990).  For  specificity,  consider  the  following  panel 
study:  One  hundred  randomly  selected  subjects  were  asked  how  interested 
they  were  in  the  political  campaigns.  They  were  to  respond  on  the  3  point 
ordinal  scale,  (1)  Not  Much,  (2)  Somewhat,  and  (3)  Very  Much.  Then  four 
years  later  the  same  group  of  subjects  was  asked  to  respond  on  the  same 
scale  to  the  same  question.  A  separate  investigation  into  the  association 
structure  would  enable  us  to  answer  questions  of  a  conditional  nature.  For 
example,  we  could  estimate  the  probability  of  responding  'Very  Much'  on  the 
second  occasion  given  that  the  response  at  the  first  occasion  was  'Not  Much'. 
The  description  of  these  'transitional'  probabilities,  although  very  interesting, 
may  not  be  completely  satisfactory.  We  may  also  be  interested  in  addressing 
questions  with  regard  to  the  marginal  distributions.  Perhaps  we  would  like 
to  answer  the  question,  "Are  the  distributions  of  responses  to  the  political 
interest  question  the  same  for  each  occasion?"  Laird  (1991),  in  a  nice  review  of 
likelihood-based  methods  for  longitudinal  analysis,  mentions  that  the  utility 
of  classical  log-  and  logit-linear  models  is  restricted  to  two  situations:  (1) 
modeling  the  dependence  of  a  univariate  response  on  a  set  of  covariates  and 
(2)  modeling  the  association  structure  between  a  set  of  multivariate  responses. 
These  models  place  structure  on  the  joint  probabilities  and  so  they  are  not 
directly  useful  for  studying  the  dependence  of  the  marginal  probabilities  on 
occasion  and  other  covariates.  This  problem  was  pointed  out  by  several 
authors  (Cox,  1972;  Prentice,  1988;  McCullagh  and  Nelder,  1989; 


-5- 
Liang  et  al.,  1991).  An  advantage  of  these  models  is  that  they  are  simple  to  fit 
using  either  WLS  (Grizzle  et  al.,  1969),  ML  (McCullagh  and  Nelder,  1989), 
or  iterative  proportional  fitting  (Bishop  et  al.,  1975)  methods.  There  are 
many  standard  statistical  programs  available  for  fitting  these  models  (SAS, 
SPSS%  BMDP,  GLIM,  GENSTAT). 

Modeling  Marginal  Distributions  Separately.  A  second  approach  to  an- 
alyzing multivariate  categorical  responses  is  to  model  only  the  marginal 
distributions  and  to  ignore  the  joint  distribution  structure.  Pull  likelihood 
methods  that  consider  only  models  for  the  marginal  probabilities  tacitly 
assume  a  saturated  model  for  the  joint  distribution.  Therefore,  the  models 
may  be  far  from  parsimonious.  In  the  non-Gaussian  response  setting,  there 
is  a  distinction  between  these  marginal  models  and  the  transitional  (or 
conditional)  models  of  the  previous  paragraph.  Marginal  models  describe  the 
occasion-specific  distributions  and  the  dependence  of  those  distributions  on 
the  covariates.  Transitional  or  conditional  models  describe  the  distribution 
of  individual  changes  over  occasions.  Models  for  these  transitions  can  be 
represented  as  probability  distributions  for  the  future  state  'given'  the  past 
states.  Questions  regarding  transition  probabilities  can  only  be  investigated 
with  longitudinal  data.  On  the  other  hand,  questions  regarding  the  marginal 
probabilities  could  theoretically  be  answered  using  cross-sectional  data, 
provided  the  cohort  (subject)  effects  were  negligible.  Panel  studies  resulting 
in  longitudinal  data  result  in  more  powerful  tests  for  significance  of  within 
cluster  factors,  such  as  occasion  effect.  This  follows  because  there  is  a  reduced 
cohort  effect;  we  are  using  the  same  panel  of  subjects  at  each  occasion.  For 
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further  discussion  about  the  distinction  between  marginal  and  transitional 
models,  see  Ware  et  al.  (1988),  Laird  (1991),  and  Zeger  (1988). 

We  will  briefly  discuss  existing  methods  for  making  inferences  about 
the  marginal  probabilities  separately.  We  will  group  these  methods  into  5 
categories:  (1)  nonmodel-based  methods,  (2)  WLS  methods,  (3)  ML  methods, 
(4)  Semi-parametric  methods,  and  (5)  other  methods. 

Nonmodel-based  methods  can  be  used  to  derive  test  statistics  used  for 
testing  specific  hypotheses  regarding  the  marginal  distributions.  Examples 
include  the  Cochran-Mantel-Haenszel  (1950,  1959)  statistic  which  can  be  used 
for  testing  the  hypothesis  of  marginal  homogeneity  (MH)  (cf.  White  et  al., 
1982),  McNemar's  (1947)  statistic  which  can  be  used  for  testing  the  equality  of 
two  dependent  proportions,  and  Madansky's  (1963)  likelihood-ratio  statistic 
for  MH.  Madansky's  statistic  is  a  difference  in  fit  of  the  model  of  marginal 
homogeneity  to  the  fit  of  the  unstructured  (saturated)  model  (see  also  Lipsitz, 
1988  and  Lipsitz  et  al.,  1990).  Many  other  relevant  test  statistics,  some  of 
which  are  generalizations  or  modifications  of  the  aforementioned  (cf.  Mantel, 
1963;  White  et  al.,  1982),  exist.  Cochran's  (1950)  Q  statistic  and  Darroch's 
(1981)  Wald-type  statistic  are  examples  of  other  test  statistics  that  can  be 
used  to  test  for  marginal  homogeneity. 

Presently,  if  one  was  to  fit  a  marginal  model,  say  a  generalized  loglinear 
model  of  the  form  Clog  A/Li  =  Xfi,  where  /i  is  the  vector  of  expected  counts 
in  the  full  contingency  table,  he  or  she  would  most  likely  use  the  WLS  fitting 
algorithm.  Most  statistical  software  that  fits  these  generalized  loglinear 
models  does  so  using  WLS.  There  are  some  advantages  to  using  WLS.  It 
is  computationally  simple.   Second-order  marginal  information  is  all  that  is 
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needed.  And,  the  estimates  are  asymptotically  equivalent  to  ML  estimates. 
Some  disadvantages  are  that  covariates  must  be  categorical,  sampling  zeroes 
create  problems,  and  estimates  are  sensitive  when  second-order  marginal 
counts  are  small.  The  WLS  method  for  analyzing  categorical  data  was 
originally  outlined  by  Grizzle,  Starmer  and  Koch  (1969).  Subsequently, 
marginal  models  for  longitudinal  categorical  data,  or  more  generally  mul- 
tivariate categorical  response  data,  have  been  introduced  and  fitted  using  the 
WLS  method  (Koch  et  al.,  1977;  Landis  and  Koch,  1979;  Landis  et  al.,  1988; 
Agresti,  1989). 

Maximum  likelihood  fitting  of  marginal  models  is  more  difficult  since 
the  model  utilizes  marginal  probabilities,  rather  than  joint  probabilities  to 
which  the  likelihood  refers.  When  the  responses  are  correlated,  as  they 
invariably  are,  the  marginal  counts  do  not  follow  a  product-multinomial 
distribution.  The  full-table  likelihood  must  be  maximized  subject  to  the 
constraint  that  the  marginal  probabilities  satisfy  the  model.  Haber  (1985a) 
considers  fitting  generalized  loglinear  models  of  the  form  C  log  Ajx  =  Xf3  using 
Lagrange  multipliers  and  an  unmodified  Newton-Raphson  iterative  scheme. 
The  algorithm  becomes  very  difficult  to  implement  for  even  moderately  large 
tables.  This  is  primarily  due  to  the  difficulty  of  inverting  the  large  Hessian 
matrix  of  the  Lagrangian  objective  function.  In  this  dissertation  we  consider  a 
modified  Newton-Raphson  that  uses  a  much  simpler  matrix  than  the  Hessian. 
The  matrix  is  easily  inverted  even  for  relatively  large  tables.  Haber  (1985b) 
considers  the  estimation  of  the  parameters  f3  in  the  special  case  Clog/i  =  X(3. 
We  will  use  a  modification  of  the  method  of  Aitchison  and  Silvey  (1958,  1960) 
and  Silvey  (1959)  to  investigate  the  asymptotic  behavior  of  the  estimators  of 
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(3  in  the  more  general  model  dog  A//  =  Xj3,  thereby  extending  the  work  of 
Haber  (1985b).  Another  relevant  paper,  Haber  and  Brown  (1986),  considers 
ML  fitting  of  a  model  for  the  expected  counts  ^  that  has  loglinear  and 
linear  constraints.  One  can  test  hypotheses  about  the  marginal  probabilities 
by  comparing  the  fit  of  relevant  models.  Haber  (1985a,  1985b)  and  Haber 
and  Brown  (1986)  only  consider  fitting  the  marginal  models  separately.  No 
attempt  has  been  made  to  simultaneously  model  the  joint  and  marginal 
distributions. 

Semi-parametric  methods  such  as  quasi-likelihood  (Wedderburn,  1974) 
and  a  multivariate  extension,  generalized  estimating  equations  (GEE),  have 
become  popular  in  recent  years.  The  work  of  Liang  and  Zeger  (1986),  which 
advocated  the  use  of  these  GEEs,  has  been  extended  to  cover  the  multivariate 
categorical  response  data  setting  (Prentice,  1988;  Zhao  and  Prentice,  1991; 
Stram  et  al.,  1988;  Liang  et  al.,  1991).  With  these  semi-parametric  methods, 
the  likelihood  is  not  completely  specified.  Instead,  generalized  estimating 
equations  are  chosen  so  that,  when  the  marginal  model  holds,  even  if  the 
association  among  the  multiple  responses  is  misspecified,  the  estimators  are 
consistent  and  asymptotically  normally  distributed.  These  estimators,  used 
in  conjunction  with  a  robust  estimator  of  their  covariance  (Liang  and  Zeger, 
1986;  Zeger  and  Liang,  1986;  White,  1980,  1981,  1982;  Royall,  1986),  result 
in  consistent  inference  about  the  effects  of  interest.  When  the  responses  are 
truly  independent,  the  estimating  equations  with  correlation  matrix  taken  to 
be  the  identity  matrix,  are  equivalent  to  the  likelihood  equations.  The  GEE 
approach  requires  the  specification  of  a  'working'  association  or  correlation 
matrix.      Examples  of  working  associations  include  those  that  imply  all 
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pairwise  associations  (measured  in  terms  of  odds  ratios)  are  the  same  and 
that  the  higher  order  associations  are  neghgible  (Liang  et  al.,  1991). 

A  related  approach  is  known  as  GEE2.  The  consistency  of  these  esti- 
mators follows  only  if  both  the  marginal  model  and  the  pairwise  association 
model  are  correctly  specified.  This  approach  is  a  second  order  extension 
of  the  GEEs  of  Liang  and  Zeger  (1986)  which  are  now  termed  GEEl.  It 
is  second  order  because  the  estimation  of  the  marginal  model  parameters 
and  the  pairwise  association  model  parameters  is  considered  simultaneously. 
The  focus  of  both  approaches,  GEEl  and  GEE2,  is  usually  on  modeling 
the  marginal  distributions — investigating  how  the  marginal  distributions 
depend  on  occasion  and  covariates.  The  association  is  considered  a  nuisance. 
Presently,  there  are  no  tests  for  goodness-of-fit  of  these  models  and  so  the 
investigation  into  how  well  both  models  fit  can  be  done  only  at  an  empirical 
level.  The  assumption  that  higher  order  effects  are  negligible  may  not  be 
tenable.  Testing  procedures  to  assess  the  validity  of  these  assumptions  have 
yet  to  be  developed.  Also,  in  contrast  to  WLS  and  ML  methods,  which 
require  only  that  the  missing  data  be  'missing  at  random'  (MAR),  the  semi- 
parametric  approaches  require  the  missing  data  to  be  'missing  completely 
at  random'  (MCAR).  The  assumption  that  the  missing  data  mechanism  is 
MCAR  is  a  much  stronger  assumption  than  MAR  (Little  and  Rubin,  1986). 

Finally,  there  are  many  other  approaches  to  analyzing  the  marginal 
probability  structure  separately.  There  are  random  effects  models,  whereby 
subject-specific  random  effects  induce  a  correlation  structure  on  the  multiple 
responses.  The  marginal  approach — the  full  likelihood  is  obtained  by 
averaging  across  the  random  effects — is  computationally  difficult  (Stiratelli 
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et  al.,  1984).  An  alternative  is  to  condition  on  the  sufficient  statistics 
for  the  subject  effects  and  consider  finding  the  estimates  by  maximizing 
the  conditional  likelihood.  For  further  details  on  these  conditional  and 
unconditional  methods  see  Rasch,  1961;  Tjur,  1982;  Agresti,  1991;  Stiratelh 
et  al.,  1984;  Conaway,  1989,  1990.  As  yet  another  alternative,  Koch  et  al. 
(1980)  give  a  bibliography  for  relevant  nonparametric  methods  for  analyzing 
repeated  measures  data.  Agresti  and  Pendergast  (1986)  consider  replacing 
the  actual  observations  by  their  within  cluster  rank  and  testing  for  marginal 
homogeneity  using  the  ordinary  ANOVA  statistic  for  repeated  measures  data. 
A  three-stage  estimator  for  repeated  measures  studies  with  possibly  missing 
binary  responses  has  been  developed  by  Lipsitz  et  al.  (1992).  This  approach 
is  very  similar  to  a  generalized  least  squares  approach,  but  it  has  some  of 
the  nice  features  of  the  GEE  approaches.  One  of  these  nice  features  is  that 
the  estimators  and  their  variance  estimates  are  consistent  under  very  mild 
assumptions.  An  extension  of  this  method  to  the  polytomous  response  case 
has  yet  to  be  developed. 

Simultaneous  Investigation  of  Joint  and  Marginal  Distributions.  There 
has  been  very  httle  work  done  to  investigate  simultaneously  the  joint  and 
marginal  distribution  structure.  In  some  ways  GEE2  is  an  attempt  to 
describe  both  distributions.  However,  only  the  pairwise  (not  the  joint) 
association  structure  is  modeled;  the  higher-order  associations  are  considered 
a  nuisance.  Tests  comparing  nested  models  have  not  been  developed  in  this 
semi-parametric  setting.  Full  Ukelihood  approaches  have  been  addressed 
by  Dale  (1986),  McCullagh  and  Nelder  (1989,  Chapt.  6),  and  Becker  and 
Balagtas  (1991).     Dale  models  the  joint  distributions  of  bivariate  ordered 
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categorical  responses  by  assuming  that  the  log  global  odds  ratios  follow  a 
linear  model.  The  marginal  probabilities  are  assumed  to  follow  a  cumulative 
logit  model.  McCullagh  and  Nelder  consider  simultaneously  modeling  the 
joint  and  marginal  probabilities  of  a  bivariate  dichotomous  response  (two 
distinct  responses)  by  assuming  that  the  log  odds-ratios  follow  a  linear 
model  and  that  the  marginal  probabilities  follow  a  logit-linear  model.  Their 
example  included  age  as  a  categorical  covariate.  Finally,  Becker  and  Balagtas 
consider  models  for  two-period  cross-over  data.  The  bivariate  dichotomous 
response  was  the  response  to  the  two  different  treatments.  Order  of  treatment 
application  was  considered  a  covariate.  They  assumed  that  the  two  log  odds 
ratios  followed  a  linear  model  and  that  the  marginal  probabilities  satisfied  a 
loglinear  model.  Because  it  is  the  marginal  probabilities  and  not  the  joint 
probabilities  that  satisfy  a  loglinear  model,  Becker  and  Balagtas  refer  to  the 
model  as  log  nonlinear. 

The  ML  model  fitting  approach  used  by  each  of  these  authors  involves 
a  reparameterization  of  the  likelihood,  which  is  a  function  of  the  joint 
probabilities,  in  terms  of  the  joint  and  marginal  model  parameters.  The 
reparameterization  in  the  bivariate  response  case — the  case  each  author 
considered — is  somewhat  complicated  especially  for  multi-level  responses.  To 
make  matters  worse,  the  extension  of  this  method  to  general  multivariate 
polytomous  responses  looks  to  be  extremely  difficult.  If  the  repaparameter- 
izations  are  made  so  that  the  full  likelihood  is  expressible  in  terms  of  the 
joint  and  marginal  model  parameters,  the  likelihood  can  be  maximized  using 
a  Newton-Raphson-type  algorithm.  Basically,  one  must  solve  for  the  root  of 
some  nonlinear  score  equation.  This  maximization  approach  is  very  sensitive 
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to  the  starting  value  in  that  convergence  to  a  local  maximum  is  not  likely 
unless  the  starting  estimate  is  very  close  to  the  actual  maximum.  Finding 
reasonable  starting  values  is  not  a  simple  task.  Dale  (1986)  outlines  a  method, 
specifically  for  the  models  considered  in  that  paper,  for  finding  a  starting 
estimate. 

In  this  dissertation,  we  outline  an  ML  fitting  method  that  can  easily  be 
used  to  fit  a  large  class  of  simultaneous  models,  including  those  considered 
by  Dale,  McCullagh  and  Nelder,  and  Becker  and  Balagtas.  The  approach 
involves  using  Lagrange's  method  of  undetermined  multipliers  along  with  a 
modified  Newton- Raphson  iterative  scheme.  For  all  of  the  models  considered, 
an  initial  estimate  for  the  algorithm  is  the  data  counts  themselves  along  with 
a  vector  of  zeroes  corresponding  to  a  first  guess  at  the  values  of  the  Lagrange 
multipliers.  The  convergence  of  the  algorithm  is  quite  stable.  The  extension 
to  multivariate  polytomous  response  data  is  straightforward. 

1.3     Outline  of  Existing  Methodologies — Missing  Data 

Missing  data  is  often  an  issue  when  the  response  is  multivariate  in  nature. 
Missing  data  can  also  occur  in  more  hypothetical  situations.  Examples 
include  loglinear  latent  class  models  (Goodman,  1974;  Haberman,  1988) 
and  linear  mixed  or  random  effects  models  (Laird  et  al.,  1987).  In  latent 
class  analyses,  a  latent  variable,  which  is  unobservable,  is  assumed  to  exist. 
Mixed  or  random  effects  models  posit  the  existence  of  some  unobservable 
random  variables  that  affect  the  mean  response.  In  this  brief  outline,  we  will 
consider  ML  methods  for  model  fitting  when  the  data  are  not  completely 
observable.    Little  and  Rubin  (1986)  provide  a  nice  summary  of  methods 
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for  model  fitting  with  incomplete  data.  There  are  many  ways  to  find  the 
maximum  likelihood  estimators  when  the  data  are  not  completely  observable, 
each  method  having  its  positive  and  negative  features.  We  could  work  directly 
with  the  incomplete-data  likelihood,  which  is  usually  complicated  relative  to 
the  complete-data  likelihood,  and  use  a  Newton-Raphson  or  Fisher-scoring 
algorithm.  Palmgren  and  Ekholm  (1987)  and  Haberman  (1988)  use  these 
methods  to  obtain  maximum  likelihood  estimates  and  their  standard  errors. 
Alternatively,  we  could  avoid  the  complicated  likelihood  altogether  and  use 
the  Expectation-Maximization  algorithm  (Dempster  et  al.,  1977).  Sundberg 
(1976)  discusses  the  properties  of  the  EM  algorithm  when  it  is  used  to 
fit  models  to  data  coming  from  the  regular  exponential  family.  The  EM 
algorithm  is  one  of  the  more  flexible  ML  fitting  algorithms  for  missing  data 
situations.  We  will  primarily  focus  on  this  method  for  fitting  logHnear  models 
with  incomplete  data. 

Although  the  EM  algorithm  is  easily  implemented  to  fit  loglinear  models 
with  incomplete  data,  the  algorithm  does  not  provide  an  estimate  of  precision 
of  the  model  parameter  estimators.  Meng  and  Rubin  (1991)  outline  a 
supplemental  EM  (SEM)  algorithm,  whereby,  upon  convergence  of  the  EM 
algorithm,  the  variance  matrix  for  the  model  estimators  is  adjusted  to  account 
for  missing  data.  The  adjustment  is  a  function  of  the  rate  of  convergence  of 
the  EM  algorithm,  which  in  turn  is  a  function  of  how  much  information 
is  missing.  Meng  and  Rubin  numerically  estimate  the  rate  of  convergence, 
thereby  obtaining  an  estimate  of  precision  that  reflects  missingness.  Although 
this  approach  should  prove  to  be  applicable  in  the  general  situation,  it  still 
is  desirable  to  derive  an  explicit  formula  for  the  variance  matrix  that  reflects 
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missingness.  Other  authors  (Meihjson,  1989;  Louis,  1982)  have  discussed 
methods  for  estimating  precision  of  model  estimators  when  the  data  are 
incomplete  and  the  EM  algorithm  is  used.  Meilijson's  method  involves  EM- 
aided  differentiation,  which  is  essentially  a  numerical  differentiation  of  the 
score  vector.  The  method  relies  on  the  assumption  that  the  observed  data 
components  are  i.i.d.  (identically  and  independently  distributed).  Louis 
gives  an  analytic  formula  for  the  observed  information  matrix  based  on  the 
incomplete  data.  The  computation  of  the  observed  information  matrix  based 
on  this  formula  is  not  straightforward  and  must  be  considered  separately  for 
each  special  application. 

1.4     Format  of  Dissertation 

In  Chapter  2,  we  develop  a  maximum  likelihood  method  for  fitting  a  large 
class  of  models  for  multivariate  categorical  response  data.  This  development 
follows  a  general  discussion  about  parametric  modeling.  Concepts  such  as 
degrees  of  freedom  and  model  distances  (or  goodness  of  fit)  are  described  at 
an  intuitive  level.  We  also  describe  and  compare  the  asymptotic  distributions 
of  freedom  parameter  estimators  under  product-multinomial  and  product- 
Poisson  sampling  assumptions.  Chapter  3  has  more  of  aji  applied  flavor. 
We  consider  simultaneously  modeling  the  joint  and  marginal  distributions 
of  multivariate  categorical  response  vectors.  A  broad  class  of  simultaneous 
models  is  introduced.  The  models  can  be  fitted  using  the  techniques  of 
Chapter  2.  Several  numerical  examples  are  considered.  Chapter  4  outlines  the 
ML  fitting  technique  known  as  the  EM  algorithm.  This  algorithm  is  used  to 
fit  models  with  incomplete  data.  Some  advantages  and  disadvantages  of  using 
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the  EM  algorithm  are  addressed.  The  most  important  disadvantage  is  that 
the  algorithm  does  not  provide,  as  a  by-product,  a  precision  estimate  of  the 
ML  estimators.  We  derive  an  explicit  formula  for  the  observed  information 
matrix  for  the  Poisson  loglinear  model  parameters  when  only  disjoint  sums  of 
the  complete  data  are  observable.  An  application  to  latent  class  modeling  is 
considered.  We  also  propose  an  ML  fitting  algorithm  that  uses  both  EM  and 
Newton-Raphson  steps.  The  modified  algorithm  should  prove  to  have  many 
positive  features. 

In  this  dissertation,  we  do  not  distinguish  typographically  between 
scalars,  vectors,  and  matrices.  Parameters  and  variables  are  treated  as  ob- 
jects, their  dimensions  either  being  explicitly  stated  or  implied  contextually. 
By  convention,  functions  that  map  scalars  into  scalars,  when  applied  to 
vectors,  will  be  defined  componentwise.  For  example,  if  /j,  represents  an  n  x  1 
vector,  then 

log/i   =   (log/ii,log^2,---,logAX„)'. 

We  frequently  use  abbreviations  that  are  common  in  the  statistical 
hterature.  They  include  ML  (Maximum  Likelihood),  WLS  (Weighted 
Least  Squares),  IWLS  (Iterative  (Re) Weighted  Least  Squares),  and  EM 
(Expectation-Maximization). 

The  range  (or  column)  space  of  an  n  x  p  matrix  X  is  denoted  by  M{X) 
and  is  defined  as  {/j,  :  fj,  =  X/3,  (3  e  R^}.  The  symbols  (8)  and  ®  are  the 
binary  operators  'direct  product'  and  'direct  sum'.  The  direct  (or  Kronecker) 
product  is  taken  to  be  the  right-hand  product.  That  is, 

A®B  =  {Abij}. 
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The  direct  sum,  C,  of  two  matrices  A  and  B  is  defined  as 


C=A@B= 


<i'B) 


The  symbol  D{fj,)  represents  a  diagonal  matrix  with  the  elements  of  fi  on  the 
diagonal.  That  is, 


D{^i)  = 


fHi     0     ...     0\ 
0      H2     •••      0 


VO     0     ...    ^„J 

In  Chapter  4,  we  make  use  of  the  bracket  notation  often  used  by- 
statistical  and  mathematical  programming  languages  (e.g.  Splus,  Matlab). 
To  illustrate  the  notation,  consider  a  matrix  A.  The  (sub)matrix  vl[, -2]  is 
then  matrix  A  with  the  second  column  deleted.  Similarly,  the  matrix  A[-3,] 
is  the  matrix  A  with  the  third  row  deleted. 

Equation  numbering  is  consecutive  within  sections  of  a  chapter,  the 
first  number  representing  the  chapter  in  which  it  appears.  For  example,  the 
thirteenth  equation  in  section  2.3  is  equation  (2.3.13).  Within  each  appendix, 
the  equations  are  numbered  consecutively.  For  example,  the  third  equation 
in  Appendix  B  is  numbered  (B.3).  Tables  are  numbered  consecutively  within 
chapters  so  that,  for  instance.  Table  3.2  represents  the  second  table  within 
Chapter  3.  Theorems,  lemmas,  and  corollaries  are  numbered  independently 
of  each  other.  All  are  numbered  consecutively  within  sections.  Therefore, 
Corollary  3.2.2  is  the  second  corollary  within  section  3.2  and  Theorem  2.3.1 
is  the  first  theorem  within  section  2.3. 


CHAPTER  2 
RESTRICTED  MAXIMUM  LIKELIHOOD  FOR  A  GENERAL 
CLASS  OF  MODELS  FOR  POLYTOMOUS  RESPONSE  DATA 


2.1     Introduction 

In  this  chapter,  we  consider  using  maximum  likelihood  methods  to  fit  a 
general  class  of  parametric  models  for  univariate  or  multivariate  polytomous 
response  data.  The  models  will  be  specified  in  terms  of  freedom  equations 
and/or  constraint  equations.  These  two  ways  of  specifying  models  will  be 
discussed  at  length  in  section  2.2.  The  model  specification  equations  may  be 
linear  or  nonlinear  in  the  model  parameters.  Specifically,  if  //  represents  the 
5x1  vector  of  expected  cell  means,  the  linear  constraints  will  be  of  the  form 
LfM  —  d  and  the  nonlinear  constraints  will  be  of  the  form  U'C\og{AiJ,)  = 
0.  The  freedom  equations  will  have  form  Clog(A/x)  =  X/3,  where  the 
components  of  the  vector  /3  are  referred  to  as  the  freedom  parameters.  In 
Chapter  3  of  this  dissertation,  we  discuss  more  specifically  models  that  can 
be  specified  in  terms  of  these  constraint  and  freedom  equations.  The  models 
of  that  chapter  allow  one  to  simultaneously  model  the  joint  and  marginal 
distributions  of  multivariate  polytomous  response  vectors. 

The  maximum  likelihood,  model  fitting  algorithm  of  this  chapter  utilizes 
Lagrange  multipliers  and  a  modified  Newton-Raphson  iterative  scheme.  In 
particular,  the  models  will  be  specified  in  terms  of  constraint  equations  and 
the  log  likelihood  will  be  maximized  subject  to  the  constraint  equations  being 
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satisfied.  One  common  optimization  algorithm  found  in  the  mathematics 
hterature  is  Lagrange's  method  of  undetermined  multiphers.  We  show  that 
Lagrange's  method  is  easily  implemented  for  ML  fitting  of  the  models  under 
consideration  in  this  chapter.  One  problem  with  Lagrange's  method  of 
undetermined  multiphers  for  ML  fitting  of  statistical  models  has  been  that  it 
becomes  computationally  infeasible  for  large  data  sets.  By  using  a  modified 
Newton- Raphson  method  which  involves  inverting  a  matrix  of  a  simpler  form 
than  the  more  complicated  Hessian,  we  consider  fitting  models  to  relatively 
large  data  sets. 

We  also  explore  the  asymptotic  behavior  of  the  estimators  within  the 
framework  of  constraint — rather  than  freedom — models.  Usually,  asymptotic 
properties  of  model  and  freedom  parameter  estimators  are  studied  within  the 
framework  of  freedom  models.  Aitchison  and  Silvey  (1958,  1960)  and  Silvey 
(1959)  studied  the  asymptotic  behavior  of  the  model  parameter  estimators 
when  the  model  is  specified  in  terms  of  constraint  equations.  Following  the 
arguments  of  Aitchison  and  Silvey,  we  derive  the  asymptotic  distributions  of 
both  the  model  and  freedom  parameter  estimators. 

Previous  work  by  Haber  (1985a)  addressed  maximum  likelihood  methods 
for  fitting  models  of  the  form 

Clog(^/i)  =  X^, 

to  categorical  response  data.  Subsequently,  Haber  and  Brown  (1986) 
discussed  ML  fitting  for  loghnear  models  that  were  also  subject  to  the 
linear  constraints  Lfj,  =  d,  where  these  constraints  necessarily  include  the 
identifiability  constraint  required  of  /i,  the  vector  of  product-multinomial 
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cell  means.  Both  of  these  papers  advocated  the  use  of  Lagrange's  method 
of  undetermined  multipHers  to  find  the  maximum  hkelihood  estimates  of 
the  model  parameters  /i.  The  method  of  Haber  (1985a)  involved  using 
the  (unmodified)  Newton- Raphson  method  which  becomes  computationally 
unattractive  as  the  number  of  components  in  /j.  gets  moderately  large.  Both 
Haber  (1985a)  and  Haber  and  Brown  (1986)  were  primarily  concerned  with 
measuring  model  goodness  of  fit  and  therefore  did  not  consider  estimation 
of  freedom  parameters.  Haber  (1985b)  did  consider  estimation  of  freedom 
parameters,  but  only  when  the  simpler  model  Clog//  =  X/3  was  used.  One  of 
the  several  ways  that  we  extend  the  work  of  Haber  (1985a,  1985b)  and  Haber 
and  Brown  (1986)  is  to  consider  estimation  of  the  freedom  parameters  when 
the  more  general  model  C  log  Afx  =  Xf3  is  used. 

Others  have  considered  ML  fitting  of  nonstandard  models  for  multivari- 
ate polytomous  response  data.  Laird  (1991)  outlines  the  different  approaches 
taken  by  different  authors.  As  an  example.  Dale  (1986)  considered  ML  fitting 
for  a  particular  class  of  models  for  bivariate  polytomous  ordered  response  data 
which  were  of  the  form 

Cilog(Ai/i)  =  XiA,     g{A2fi)  =  X2/32 

Specifically,  the  first  freedom  equation  specifies  a  loglinear  model  for  the 
association  between  the  two  responses  measured  by  the  global  cross-ratios 
(cross-product  ratios  of  quadrant  probabilities)  so  that  Ci  and  A^  are  of 
a  particular  form.  The  second  set  of  freedom  equations  specifies  some 
generaHzed  linear  model  (McCullagh  and  Nelder,  1989)  for  the  marginal 
means  or  probabilities.    Maximum  likehhood  estimators  for  the  association 
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model  freedom  parameters  /3i  and  the  marginal  model  freedom  parameters 
132  were  simultaneously  computed  by  iteratively  solving  the  score  equations 
via  a  quasi- Newton  approach.  To  use  this  maximization  technique,  the  score 
functions,  which  involve  the  cell  probabilities,  must  be  written  explicitly 
as  a  function  of  the  freedom  parameter  P  =  vec(/?i,  ^2)-  A  nontrivial 
approach  to  finding  reasonable  starting  values  for  (3  is  discussed  by  Dale 
(1986).  Along  with  Dale,  McCullagh  and  Nelder  (section  6.5,  1989)  and 
Becker  and  Balagtas  (1991)  consider  writing  the  score  as  an  explicit  function 
of  the  freedom  parameters  so  that  the  marginal  and  association  freedom 
parameter  estimates  may  be  computed  simultaneously.  In  general,  when  there 
are  more  than  two  responses,  this  is  not  a  simple  task  and  so  an  extension 
of  this  method  to  multivariate  polytomous  response  data  models  will  be  very 
messy  indeed.  Also,  convergence  of  the  iterative  scheme  requires  good  initial 
estimates  of  the  freedom  parameter  ^.  These  may  be  very  difficult  to  find.  In 
contrast,  the  maximization  approach  of  this  chapter,  which  is  similar  to  Haber 
(1985a)  and  Haber  and  Brown  (1986),  is  shown  to  be  easily  implemented  for 
fitting  multivariate  polytomous  response  data  models.  With  this  technique, 
it  is  not  necessary  to  write  the  cell  means  as  an  explicit  function  of  the 
freedom  parameters.  Further,  initial  estimates  of  the  freedom  parameters, 
which  are  difficult  to  find,  are  not  needed  for  this  technique.  Instead,  only 
initial  estimates  of  the  cell  means  and  undetermined  multipliers  are  needed. 
Reasonable  initial  estimates  of  the  cell  means  are  the  cell  counts  themselves. 
While  a  reasonable  initial  estimate  of  the  vector  of  undetermined  multipliers 
is  the  zero  vector — the  value  of  the  undetermined  multipliers  when  the  model 
fits  the  data  perfectly. 
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We  will  now  introduce  the  class  of  models  that  we  will  consider  for  the 
remainder  of  this  chapter  and  the  next,  more  applied  chapter.   The  models 
have  form 

Ci  log(Ai//)  =  X,/5a,      C2  log(A2//)  =  X2I32,     Lfi  =  d 

where  the  linear  constraints  include  the  identifiability  constraints.  Later, 
when  we  study  the  asymptotic  behavior  of  the  ML  estimators,  we  will 
require  the  components  of  d  to  be  zero  unless  they  correspond  to  an 
identifiability  constraint.  These  models,  which  are  of  the  form  Clog(A/z)  = 
X/5,  Ln  =  d,  will  allow  us  to  model  both  the  joint  and  marginal  distributions 
simultaneously  when  dealing  with  multivariate  response  data.  The  bivariate 
association  model  of  Dale  (1986)  is  a  special  case  of  these  models,  as  we 
can  specify  the  matrices  Cj  and  Ai  so  that  Ci  log(Ai/i)  is  the  vector  of  log 
bivariate  global  cross-ratios.  Restricting  the  marginal  models  to  have  form 
C2log(A2/i)  =  -X'2/52,  rather  than  allowing  the  marginal  means  to  follow  a 
generalized  linear  model,  as  Dale  (1986)  did,  is  not  overly  restrictive.  In 
fact,  many  of  the  generalized  linear  models  for  multinomial  cell  means  can  be 
written  in  this  form.  For  example,  loglinear,  multiple  logit,  and  cumulative 
logit  models  are  of  this  form.  Also,  unlike  Haber  (1985a)  and  Haber  and 
Brown  (1986),  we  will  be  concerned  with  estimation  of  the  freedom  parameter 
(3  =  vec(/5i,  ^2),  thereby  allowing  for  model-based  inference. 

Model-based  inferences  usually  refer  to  inferences  based  on  freedom 
parameters.  With  freedom  equations,  we  have  the  luxury  of  choosing  a 
parameterization  that  results  in  the  freedom  parameters  having  meaningful 
interpretations.     For  instance,  a  freedom  parameter  f3  may  be  chosen  to 
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represent  a  departure  from  independence  in  the  form  of  a  log  odds  ratio. 
More  generally,  we  usually  will  try  to  parameterize  in  such  a  way  so  that 
certain  parameters  will  measure  the  magnitude  of  an  effect  of  interest. 

For  example,  consider  an  opinion  poll  where  a  group  a  subjects  were 
asked  on  two  different  occasions  whether  they  would  vote  for  the  President 
again  in  the  next  election.  Suppose  they  were  asked  immediately  after  the 
President  took  office  and  again  after  the  President  had  served  for  two  years. 
The  researcher  may  be  interested  in  determining  whether  the  distribution  of 
response  changed  from  Time  1  to  Time  2  and  if  so,  assess  the  magnitude  of 
the  change.  The  data  configuration  can  be  displayed  as  in  Table  2.1. 

Table  2.1.  Opinion  Poll  Data  Configuration 


Timel 


Data 

Time  2 
yes         no 

Time  1 

yes 
no 

Probabilities 

Time  2 
yes         no 

yes 

yn 

Vu 

TTli 

7ri2 

no 

y2i 

y22 

7r2i 

7r22 

7I"2+ 


TT+i  7r+2 


We  could  formulate  a  model  of  the  form  C  log(>l/x)  =  X/3  in  such  a  way 
so  that  the  freedom  parameter  /3  has  a  nice  interpretation  with  respect  to  the 
hypothesis  of  interest.  One  such  model  is 


log{^)  =  a  +  p„    i  =  l,2 
where  the  parameter  4>j{i)  is  a  marginal  probability,  i.e. 


(2.1.1) 


«^)={:i;  ^izi 
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and,  for  identifiability  of  the  freedom  parameters, 

Pi  =  -p2  -  P- 

Model  (2.1.1)  is  a  simple  logit  model  for  the  marginal  probabilities  {7rj+}  and 
{tt+j}.  The  parameter  p  measures  the  magnitude  of  departure  from  marginal 
homogeneity  in  that  p  =  0  if  and  only  if  there  is  marginal  homogeneity. 

One  could  use  the  Wald  statistic  p/se{p)  to  test  the  hypothesis.  If  the 
null  hypothesis  is  rejected,  we  can  assess  the  magnitude  of  departure  from 
marginal  homogeneity  by  computing  a  confidence  interval  for  2p  which  is  the 
log  odds  ratio  comparing  the  odds  that  a  randomly  chosen  subject  responds 
'yes'  at  Time  2  to  the  odds  that  a  randomly  chosen  subject  responds  'yes'  at 
Time  1. 

This  simple  example  illustrates  the  utility  of  using  freedom  parameters 
and  the  corresponding  model-based  inferences.  For  this  reason,  this  chapter 
will  be  concerned  with  making  inferences  about  both  the  model  parameters 
(J.  and  the  freedom  parameters  /5. 

The  contents  of  the  following  sections  are  as  follows.  In  section  2.2, 
we  provide  an  overview  of  parametric  modeling.  The  two  ways  of  specifying 
models — via  constraint  equations  and  via  freedom  equations — -are  discussed 
at  length  in  section  2.2.1.  It  is  shown  that  a  model  specified  in  terms  of 
freedom  equations  can  be  respecified  in  terms  of  constraint  equations.  In 
particular,  the  freedom  equation  Clog(A//)  =  X^,  which  actually  constrains 
the  function  Clog(A/^)  to  lie  in  some  manifold  spanned  by  the  columns  of  X, 
is  equivalent  to  the  constraint  equation  U'Clog^Ap)  =  0,  where  the  columns 
of  U  form  a  basis  for  the  null  space  of  X' .  Other  topics  covered  in  section  2.2 
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include  interpretation  and  calculation  of  'degrees  of  freedom'  and  measuring 
model  goodness  of  fit. 

We  describe  a  general  class  of  models  for  univariate  or  multivariate 
polytomous  response  data  in  section  2.3.1.  The  data  vector  y  is  initially 
assumed  to  be  a  realization  of  a  product-multinomial  random  vector.  We 
describe  the  asymptotic  behavior  of  the  product-multinomial  ML  estimators 
in  section  2.3.3.  Lagrange's  method  of  undetermined  multipliers  is  used  to 
find  restricted  maximum  likelihood  estimates  of  the  model  parameters  and 
the  freedom  parameters.  The  actual  algorithm  is  described  in  detail  in  section 
2.3.4. 

In  section  2.4,  we  explore  the  relationship  between  the  product-multinomial 
and  product-Poisson  ML  estimators.  General  results  that  allow  one  to 
ascertain  when  inferences  based  on  product-Poisson  estimates  are  the  same  as 
inferences  based  on  product-mviltinomial  estimates  are  shown  to  follow  quite 
directly  when  one  works  within  the  framework  of  constraint  models.  Theorem 
2.4.2  of  this  section,  represents  a  generahzation  of  the  results  of  Birch  (1963) 
and  Palmgren  (1981). 

2.2     Parametric  Modeling — An  Overview 

Inferences  about  the  distribution  of  some  n  x  1  random  vector  Y  are 
often  based  solely  on  a  particular  realization  y  of  Y .  In  parametric  modeling 
it  is  often  the  case  that  the  distribution  of  Y  is  known  up  to  an  s  x  1  vector 
of  model  parameters  9]  i.e.  it  is  'known'  that 

Y  ^  F{y-e),    eeS,  (2.2.1) 
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where  0  is  some  (5-g)-dimensional  {q  >  0)  subset  of  R'  known  to  contain  the 
true  unknown  parameter  9*.    The  cumulative  distribution  function  F  maps 
points  in  R^  into  the  unit  interval  [0,  l]  and  is  assumed  to  be  known. 

In  general,  we  will  allow  the  dimension  s  of  0  to  grow  with  n.  For 
example,  let  Y  =  {Yi,. .  .,Yn)  have  independent  components  such  that 

Yi  -  ind  G{yi;zi{9)),    i  =  l,...,n, 

where  Zi{9)  is  some  function  of  9  associated  with  the  i^^  component  of  Y. 
The  function  Zi  could  be  defined  as  Zi{9)  =  9i,  in  which  case  s  =  n.  Or,  on 
the  other  hand,  Zi  could  be  a  mapping  from  R'  to  R^  with  5  fixed. 

2.2.1     Model  Specification. 

In  parametric  settings,  models  for  the  data,  or  more  precisely,  models  for 
the  distribution  of  F,  can  be  completely  specified  by  recording  the  family  of 
candidate  distributions  that  F  may  belong  to.  That  is,  one  must  specify  the 
form  for  F{-;  9)  and  the  space  0^  that  is  assumed  to  contain  the  true  value 
9*  of  9.  In  parametric  modeling,  the  form  of  F{-\  9)  is  assumed  known,  but 
the  true  value  9*  is  not.  Denote  a  parametric  model  by  [F{-;9)\9  e  0m]  or 
more  simply  by  [0^].  We  say  the  model  [0^]  'holds',  if  the  true  parameter 
value  9*  is  a  member  of  0^,  i.e. 

[0m]  holds    <^    ^*  e  0M. 

A  model  does  not  hold  if  9*  ^  0m. 

The  objective  of  model  fitting  is  to  find  a  simple,  parsimonious  model 
that  holds  (or  nearly  holds).  By  parsimonious,  we  mean  that  the  vector  9  can 
be  obtained  as  a  function  of  relatively  few  unknown  parameters.  An  example 
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of  a  parsimonious  model  for  the  distribution  of  an  n-variate  normal  vector 
with  unknown  mean  vector  /j,  and  known  covariance  is  [©/?],  where 

Qj3  =  {fJ-^  R"  '■  fJ-j  =  0,  j  =  l,...,n,  P  unknown}. 

Notice  that  all  n  components  of  n  can  be  obtained  as  a  function  of 
one  unknown  parameter  j3.  Thus,  all  of  our  estimation  efforts  can  be 
directed  towards  the  estimation  of  the  common  mean  fi.  An  example  of  a 
nonparsimonious  model  is  the  so-called  saturated  model  [0],  where 

0  =  {/i  :  ^  G  i2"}  =  i2". 

In  this  case,  /i  is  a  function  of  n  unknown  parameters. 

The  question  of  whether  or  not  the  parsimonious  model  holds  is  an 
entirely  different  matter.  Practically  speaking,  a  model  will  rarely  strictly 
hold.  Therefore,  we  will  often  say  a  model  holds  if  it  nearly  holds,  i.e.  for 
some  small  e 

infr-^||<e. 

Without  delving  too  much  into  the  philosophy  of  model  fitting  and  the 
simplicity  principle  (Foster  and  Martin,  1966),  we  point  out  that  for  a  model 
to  be  practically  useful  it  must  be  robust  to  the  'white  noise'  of  the  process 
generating  Y .  That  is,  it  should  account  for  only  the  obvious  systematic 
variation.  A  model  would  be  said  to  be  robust  to  the  white  noise  variability, 
if  the  model  parameter  estimates  based  on  different  realizations  of  Y  are  very 
similar.  As  an  example,  if  instead  of  [0/3],  the  saturated  model  [0]  was  used 
to  draw  inferences  about  the  normal  mean  vector  /i,  we  would  find  that  the 
model  fit  perfectly,  but  that  upon  repeated  sampling  the  model  estimates 
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would  change  dramatically.  Thus,  the  model  is  not  robust  to  the  white  noise 
of  the  process.  On  the  other  hand,  the  parsimonious  model  [0^]  estimates 
would  change  very  little  from  sample  to  sample,  varying  with  the  sample 
mean  of  n  observations.  This  model  is  robust  to  the  white  noise  variabiHty. 
Therefore,  if  the  model  would  hold,  or  nearly  hold,  we  would  say  it  was  a 
good  model. 

Freedom  Models.  In  the  previous  n-variate  normal  example  we  specified  a 
model  [0^]  in  terms  of  some  unknown  parameter  /5.  Aitchison  and  Silvey 
(1958,  1960)  and  Silvey  (1959)  refer  to  the  parameter  yS  as  a  'freedom 
parameter'  and  the  model  [0^]  as  a  'freedom  model'.  These  labels  are 
reasonable  since  we  can  measure  the  amount  of  freedom  we  have  for  estimating 
9  by  noting  the  number  of  independent  freedom  parameters  there  are  in  the 
model.  The  model  [0^]  has  one  degree  of  freedom  for  estimating  the  mean 
vector  /i.  Thus,  once  an  estimate  of  the  single  parameter  /3  is  obtained  the 
entire  vector  /x  can  be  estimated;  it  is  a  function  of  the  one  parameter  fi. 
Notice  that  'degrees'  of  freedom  correspond  to  integer  dimension  in  that  a 
degree  of  freedom  is  gained  (lost)  if  we  introduce  (omit)  one  independent 
freedom  parameter  thereby  increasing  (decreasing)  the  dimensionality  of  0^ 
by  one. 

In  general  we  will  denote  a  freedom  model  by  [0^],  where 

Qx  =  {0eQ:9{e)  =  X(3,/3eRn 

The  function  g  is  some  differentiable  vector  valued  function  mapping  ^  g  0 
into  r-dimensional  EucHdean  space  R".  The  'model'  matrix  X  is  an  r  xp  full 
column  rank  matrix  of  known  numbers.  To  calculate  degrees  of  freedom  for 
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[0^]  we  will  initially  assume  g  satisfies 

(dg{e) 


v^o  e  0x, 


is  of  full  row  rank  r. 

00' 


\  de> 

It  also  will  be  assumed  that  the  constraints  implied  by  g{Q)  =  X(3  are 
independent  of  the  q  constraints  implied  by  the  model  [6]  of  2.2.1.  Well 
defined  models  will  satisfy  these  conditions.  For  example,  any  g  that  is 
invertible  satisfies  the  derivative  condition.  Actually  this  derivative  condition 
is  not  a  necessary  condition  for  the  model  to  be  well  defined.  Later,  we  will 
show  that  g  need  only  satisfy  a  milder  derivative  condition. 

The  degrees  of  freedom  for  the  model  [0^]  can  be  obtained  by  subtract- 
ing the  number  of  constraints  implied  by  [Qx]  from  the  total  number  of  model 
parameters,  s.  The  number  of  constraints  implied  by  [0^]  is  (r  -  j»)  +  g,  the 
dimension  of  the  null  space  of  X'  plus  the  q  constraints  imphed  by  model  [0]. 
Hence,  the  model  degrees  of  freedom  for  [0^]  is 

df[Qx\  =  s~{r-p  +  q)  (2.2.2) 

In  view  of  (2.2.2)  the  model  degrees  of  freedom,  an  integer  measure  of  freedom 
one  has  for  estimating  ^,  is  an  increasing  function  of  p  the  number  of  freedom 
parameters.  In  fact,  for  the  special  case  when  ^  =  0  and  g{d)  =  6  {so  s  =  r), 
we  have  that  the  number  of  degrees  of  freedom  for  model  [Qx]  is  simply  p, 
the  number  of  freedom  parameters.  This  gives  us  another  good  reason  for 
calling  /3  a  freedom  parameter  and  [Qx]  a  freedom  model. 
Constraint  Models.    Notice  that 

{eee:g{d)  =  X/3,(3eRP}  (2.2.3) 

can  be  rewritten  as 

{^  e  0  :  U'g{e)  =  0}, 
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where  U  is  axi  r  x{r  -p)  full  column  rank  matrix  satisfying  U'X  =  0,  i.e.  the 
columns  of  U  form  a  minimal  spanning  set,  or  basis,  for  the  null  space  of  X'. 
Letting  u  =  r  -p  and  K{d)  =  0  be  the  q  constraints  implied  by  [0],  we  can 
write  the  {u  +  q)  xl  vector  of  constraining  functions  as  h{6)  =  [hi{d),h^{6)]' 
where  hi  =  U'g.  We  rewrite  the  freedom  model  [0^]  of  (2.2.3)  as  [0;,],  where 

eh  =  {eeR':  h{9)  =  0}.  (2.2.4) 

Aitchison  and  Silvey  (1958,1960)  refer  to  model  [0,,]  as  a  constraint  model. 
Every  freedom  model  can  be  written  as  a  constraint  model. 

We  present  a  few  simple  examples  to  illustrate  the  equivalence  between 
the  two  model  formulations — freedom  and  constraint. 

Example  1.  Let  Yi  ~  ind  N{/3,a^),  i  =  l,...,n,  where  <t^  is  known. 
This  model  can  be  specified  as  the  freedom  model  [Qx]^  where 

®x  =  {iJi  e  R^  :  IJ,  =  lnf3,  (3  unknown  } 

or  equivalently  it  can  be  expressed  as  the  constraint  model  [0^],  where 

0ft  =  {;^  e  i2"  :  U'n  =  0} 

and  U'  is  the  (n  -  1)  x  n  matrix 


U'  = 


/I     -1      0      0     •••      0  \ 
1     0     -1    0    ...     0 


Vi    0     0    0   ...   -1/ 

It  is  easily  seen  that  0^  =  0/,  and  that  the  model  degrees  of  freedom  is 
df[Qx]=n-{n~l)  =  l. 

Example  2.  Let  Yi    ~   ind  iV(/ij  =  (3q  +  f3iXi,cr'^),  i  =  l,...,n,  where  <t^ 
is  known.  This  model  can  be  specified  as  the  freedom  model  [©x]?  where 
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or  assuming  that  each  x,-  is  distinct,  as  the  constraint  model  [0;,],  where 


©h  =  {^i  e  i2"  :  U'l^  =  0}. 


Here  U'  is  the  (n  -  2)  x  n  matrix 


U' 


I 


X2-Z1 


12-Zl 


V  Xj-Ii 


1 


+  :^ 


-1 


3=2-21  X3-Z2  ZS-XJ 

1  1 


0 

-1 


Xi-Xl 


Z4-X3  Z4-XS 


X2-Z1 


0 


0 


...     0 

...     0 

..     0 


0 

0 


0       \ 
0 


-1 


Z„— Z„_l  Xn—Xn-1  I 


Notice  that  C/'/z  =  0  impHes  that 


•^j+i  ~  ^i       ^k-^\  —  ^k 


,    Vfc,;. 


That  is,  the  n  means  fall  on  a  Hne.  As  before,  it  can  be  seen  that  ©x  =  ©/i 
and  that  the  model  degrees  of  freedom  is  df[Qh]  =  n  -  {n  -  2)  =  2. 

Definitions.  We  will  assume  that  the  constraining  function  h  satisfies 
some  reasonable  conditions  so  that  the  model  is  well  defined.  We  first  present 
some  definitions. 

(1)  A  model  [©^j  is  said  to  be  'consistent'  if  ©/,  7^  0. 

(2)  A  consistent  model  [©;,]  is  said  to  be  'well-defined'  if  the  Jacobian 
matrix  for  h  is  of  full  row  rank  u  =  u  +  q  at  every  point  in  Q^-  That  is, 


V^o  e  ©/,, 


fdh{9) 


V  de> 


Oo 


is  of  full  row  rank  u. 


(3)  A  model  [©^j  is  said  to  be  'ill-defined'  if  it  is  not  well-defined,  i.e. 


...ee.(§?l 


is  not  of  full  row  rank  f. 
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(4)  An  ill-defined  model  [0;,]  is  said  to  be  'inconsistent'  or  'incompatible' 
if  0A  =  0. 

Briefly,  any  reasonable  model  will  have  a  nonempty  parameter  space  and 
hence  will  be  consistent.  The  Jacobian  condition  of  definition  (2)  is  similar 
to  the  condition  required  in  the  Implicit  Function  Theorem  (see  Bartle,  1976). 
Basically,  this  condition  requires  the  constraints  to  be  nonredundant  so  that, 
at  least  theoretically,  the  constraint  equations  can  be  written  uniquely  as 
a  function  of  a  smaller  set  of  parameters.  An  ill-defined  model  has  been 
specified  with  a  redundant  set  of  constraint  equations.  Using  the  lingo  of 
the  optimization  literature,  two  constraints  are  redundant  if,  for  each  point 
in  the  parameter  space,  both  of  the  constraints  are  'active'  or  both  of  the 
constraints  are  'inactive'.  That  is,  for  all  parameter  values,  if  one  constraint 
is  active  (inactive)  then  the  other  is  necessarily  active  (inactive). 

It  should  be  noted  that  the  above  definitions  are  in  terms  of  the 
constraint  formulation  of  a  model.  This  is  sufficient  since  freedom  models  can 
be  written  as  constraint  models.  For  convenience,  we  give  sufficient  conditions 
for  a  freedom  model  to  be  well-defined. 

A  consistent  freedom  model  is  well-defined  if  it  satisfies  the  following  two 
conditions: 

(i)   The  constraints  implied  by  g{e)   =  X/3  are  independent  of  the  q 

constraints  implied  by  [0]. 

(ii)  The  Jacobian  matrix  of  g  evaluated  at  any  point  in  [Qx]  is  of  full  row 

rank  r,  i.e. 


v.ee.,(M?) 


Oo 


j ,    is  of  full  row  rank  r. 
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The  sufficiency  of  conditions  (i)  and  (ii)  can  be  seen  by  observing  that 
(ii)  implies  that  hi  =  U'g  has  a  full  row  rank  Jacobian  since  U'  is  of  full  row- 
rank  and  (i)  implies  that  h  =  (/ii,/i*)'  has  full  row  rank  Jacobian.  These 
sufficient  conditions  are  by  no  means  necessary  for  a  model  to  be  well  defined 
as  the  Jacobian  of  h  may  be  of  full  row  rank  u  even  when  the  Jacobian  of  g 
is  not  of  full  row  rank. 

Notice  that  the  model  matrix  has  nothing  to  do  with  whether  or  not  a 
model  is  well  defined.  In  particular,  one  may  think  that  the  model  [Qx]  is 
ill-defined  whenever  the  r  x  p  matrix  X  is  not  of  full  column  rank;  i.e.  the 
freedom  parameters  are  nonestimable.  However,  the  model  can  be  rewritten 
as  a  constraint  model  with  the  full  column  rank  matrix  U  spanning  the  null 
space  of  X,  which  has  dimension  less  than  p  -  r.  It  follows  that  if  g  satisfies 
(i)  and  (ii),  then  the  model  [©x]  will  be  well-defined.  The  only  reason  we 
have  taken  X  to  be  of  full  column  rank  is  to  avoid  using  generalized  inverses 
when  working  with  the  freedom  parameters. 

To  illustrate  the  use  of  these  definitions,  we  consider  the  model  [Qm]j 
where 

eM  =  {OeR'':Me-d  =  0}. 

The  model  will  be  well  defined  if  dh/dO'  —  M  is  oi  full  row  rank.     It  is 
inconsistent  if  the  linear  system  of  equations  M9  =  d\s  inconsistent. 

If  a  model  [0;,]  is  well  defined,  then  the  constraints  implied  by  the  model 
are  all  independent  in  that  no  constraint  can  be  implied  by  the  others.  We 
will  consider  only  well-defined  models  when  calculating  degrees  of  freedom. 
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As  before,  we  calculate  degrees  of  freedom  for  a  model  as  the  difference 
between  the  number  of  model  parameters  s  and  the  number  of  independent 
constraints  u  implied  by  the  model,  i.e. 

^/[®h]  =  s-{r-p  +  q)  =  s-{u  +  q)  =  s-u 

Notice  that  for  the  constraint  model,  model  degrees  of  freedom  is  a  decreasing 
function  of  the  number  of  independent  constraints  u. 

Finally,  it  should  be  noted  that  models  may  be  specified  in  terms  of 
both  freedom  equations  and  constraint  equations.  In  fact,  in  subsequent 
sections  this  will  be  the  case.  However,  without  loss  of  generality,  we  will 
concentrate  on  constraint  models  since  any  model  can  be  written  in  the  form 
of  a  constraint  model. 

2.2.2     Measuring  Model  Goodness  of  Fit 

Inferences  about  model  parameters  are  reliable  only  if  the  model  is 
'good'.  A  good  model  should  be  well  defined  (or  at  least  consistent).  It 
should  be  simple  and  parsimonious.  Finally,  the  model  should  be  relatively 
close  to  holding. 

To  assess  whether  or  not  the  model  holds,  we  will  need  the  concept  of  a 
distance  between  two  models.  To  begin,  we  will  assume  there  is  some  measure 
of  distance  between  two  hierarchical  parametric  models.  (Two  models  [0i] 
and  [©2]  are  hierarchical  if  02  C  ©i  and  c(f  [©2]  <  c(f  [©1]  whenever  ©1  ^  ©2.) 
This  (parametric)  distance  will  be  a  quantitative  comparison  of  how  close 
the  two  models  are  to  holding.  Thus,  if  both  models  hold  the  distance  is 
zero.  The  distance  will  also  be  independent  of  the  model  degrees  of  freedom. 
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Recall  that  the  form  of  F(.;  $)  is  assumed  known.  Therefore,  the  distance  will 
measure  how  far  the  true  parameter  is  from  falling  in  the  parametric  model 
space.  Suppose,  firstly,  that  0i  and  02  are  general  parameter  spaces.  That 
is,  ^  e  ©1  u  02  does  not  necessarily  define  a  probabiUty  distribution.  In  other 
words,  d  need  not  fall  in  a  subset  of  an  (5  -  1) -dimensional  simplex.  Let  a{e) 
and  b{$)  be  vector  or  matrix  valued  functions  of  the  unknown  parameter  9. 
Define  a  distance  between  two  hierarchical  models  [0i]  and  [Q2]  (©2  Q  ©i)  as 

^[©2;  0i]  =  inf  \\b{9){a{e)  -  a{e*))\\'  -  inf  \\b{e){a{e)  -  a{e*))\\\ 

Notice  that  a  and  b  can  be  chosen  so  that 

(l)(^[02;0i]>O 

(2)  <5[02;  0a]  =  0,    iff  0i  and  02    hold. 
For  example,  consider  the  case  Y  ~  MVN„{n,a^In).  Suppose  that 

[0]  =  {(M,^'):A^6i2",<T2>O} 

[©2]  =  {{f^,cr')  :fx  =  X(3,^e  R^,<j'  >  0} 
[©3]  =  {(^,<t2)  :  /i  =  l„a,a  6  R,a^  >  0}. 

In  this  example,  each  component  of  Y  has  a  common  variance  a^.  It  seems 
reasonable  that  differences  between  any  ij,j  and  the  true  mean  fj,*-  are  equally 
important.  Hence,  a  natural  distance  between  any  two  of  these  models  is 


«5[©M,;  ©mJ  =  inf  ll/i  - /^*||2  -  inf  II^Li  - /i*||2. 
Notice  that  a{^,cT^)  =  /i  and  b{fx,a^)  =  1.  Hence,  the  measure  of  distance 


(2.2.5) 
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between  [0]  and  [0i]  is 

8[Q,;e]=mi\\fi-fx*f  =  \\fMo-fi*\\'- 

The  second  infimum  is  zero  since  the  model  [0]  is  known  to  hold. 
The  measure  of  distance  between  [02]  and  [0]  is 

<5[02;0]  =  inf||;u-/i*|p  =  inf||X^-;u*|p 

=  \\X{X'X)-'X'fx*-^i*\\^ 

=  ||(7„-X(X'X)-iX')/i*|p 

=  fi*'{In-X{X'X)-'X')fx\ 

This  is  the  squared  length  of  the  vector  orthogonal  to  the  projection  of  /i* 
onto  the  range  space  of  X.  Notice  that  if  fj,*  =  X(3*,  that  is  02  holds,  then 
<5[02;0]  =  O. 

Finally,  the  distance  between  [03]  and  [02]  is 
<5[03;02]=inf||/z-/i*||2-inf||/x-;u*||2 

=  /.*'(/„  -  l^K  -  ;x*'(7„  -  XiX'X)-'X')fx*  (2.2.6) 

=  tx*'{X{X'X)-'X'  -  i^)/z*. 

As  another  example,  consider  a  random  vector  Y  =  {Y^ ,...,¥„)' ,  with 
independent  components  following  an  exponential  dispersion  distribution 
(J0rgenson,  1989).  That  is, 

Yi   ~   indep  £^Z)(/ii,cT^),    i  =  1,. .  .,n, 

where  the  density  of  1^,  with  respect  to  some  measure,  has  form 

fviy;  7i, 0-2)  =  a{y,  cr2)  expf-^ {yji  -  /c(7i)}  (2.2.7) 


-36- 

where  /i^  =  i^'{li)  and  var(yi)  =  a^K"{'yi).  Let  V{fi)  =  eiK"(7i)  and 
6  =  (/ii, . .  .,/i„,cr2)'.  Since  the  components  of  Y  have  different  variances, 
a  natural  measure  of  distance  is 


S[eM.;  SmA  =  inf  ||r(/.)-V2(^  _  ^*)||2  _  inf  ||F(;x)-V2(^  _  ^*)||2.      (2.2.8) 


That  is  a{9)  =  fi  and  b{9)  =  y(/i)-V2.  Premultiplying  the  vector  {fx  -  fj,*)  by 
V(^)~^/2  has  the  effect  of  downplaying  those  differences  (/x^  -  /^*)  when  the 
corresponding  variance  is  large. 

To  assess  the  goodness  of  fit  of  a  model,  relative  to  another,  we  can 
estimate  the  distance  6  via  some  statistic  based  on  the  observed  data.  It 
is  interesting  to  note  that  when  6  =  0,  i.e.  both  models  hold,  our  data- 
based  estimate  of  this  null  distance  will  be  some  nonnegative  (positive,  if 
the  model  is  unsaturated)  number,  reflecting  the  amount  of  white  noise  or 
random  variability  there  is  in  Y.  This  is  so  because,  if  both  models  hold, 
then  the  only  reason  that  our  estimate  of  distance  would  be  nonzero  would 
be  because  Y  has  some  random  component.  That  is,  the  variability  in  Y  that 
is  not  explained  by  the  model  causes  the  data  to  fit  the  model  imperfectly. 

Let  D  be  an  estimate  of  6.  That  is,  D[Q2]  ©i]  is  a  stochastic,  data-based 
estimate  of  how  far  apart  models  [0i]  and  [©2]  are.  Potential  candidates 
for  D  are  the  weighted  least  squares,  likelihood  ratio,  Wald,  deviance,  and 
Lagrange  multiplier  statistics. 

For  example,  consider  the  n-variate  normal  case  and  the  four  candidate 
models  [0],  [0i],  [02],  and  [03].  We  will  assume  that  both  [0]  and  [02] 
hold.  In  view  of  (2.2.5)  a  reasonable  estimate  of  ^[©2;  0]  can  be  obtained  by 
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replacing  /j*  by  Y,  the  estimate  of /i*  under  model  [0],  i.e. 

£>[02;  0]  =  y'(j„  -  xix'x)-'X')Y  =  j;(y;  -  fiy. 

1 

Recall,   that  since  ^[02;  0]  is  known  to  be  zero,  I?[02;0]  serves  as  our 
'estimate  of  error'. 

Similarly,  a  reasonable  estimate  of  ^[03,-  ©j]  can  be  obtained  by  replacing 
H*  in  (2.2.6)  by  Y,  the  least  restrictive  estimate  of  //*,  i.e. 

^[©3;  02] = Y'lxix'xyx'  -  h}ik)Y  =  f^iYi  -  yy- 

1 
Now  03  c  02  and 

cf/[03]  =  n  +  l-(n-l)  =  2 

d/[02]  =  n  +  l-{n-p)=p+l. 
The  degrees  of  freedom  associated  with  estimating  the  distance  between 
two  models  will  be  called  the  distance  (or  residual  or  goodness-of-fit)  degrees 
of  freedom.  The  distance  degrees  of  freedom  for  the  two  models  [©mJ  and 
[0jvfj  is  defined  to  be  the  difference  between  the  two  model  degrees  of  freedom, 
i.e. 

dmeM,]QM,])  =  dfieu,]  -  dfiQu,]. 

The  number  of  distance  degrees  of  freedom  measures  the  dimensional  distance 
between  the  two  models,  i.e.  the  difference  in  dimensions.  It  measures  the 
difference  in  the  amount  of  freedom  one  has  for  estimating  9  for  the  two 
models.  It  seems  intuitive  that  if  the  degrees  of  freedom  is  large,  that  is  the 
dimensional  difference  between  the  two  models  great,  the  significance  of  the 
distance  statistic  may  be  difficult  to  ascertain.  This  follows  since  we  expect 
the  fit  to  be  quite  different  for  the  two  very  different  models,  even  when  both 
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models  hold.  This  is  a  reflection  of  both  white  noise  and  possibly  lack  of  fit. 
Therefore,  the  distance  statistic  will  tend  to  be  large,  even  when  both  models 
hold.  But  for  many  statistics,  a  large  mean  implies  a  large  variance,  thereby 
making  significant  findings  more  difficult.  It  is  for  this  reason  that  we  say 
it  is  better  to  concentrate  our  efforts  on  relatively  few  degrees  of  freedom 
to  detect  lack  of  fit.  That  is,  one  should  use  the  smallest  alternative  space 
possible  when  testing  a  null  hypothesis. 

A  more  technical  argument  holds  when  the  test  statistic  (distance 
statistic)  is  a  Chi-square  or  an  F.  Das  Gupta  and  Perlman  (1974)  showed 
that  for  a  fixed  noncentrality  parameter,  i.e.  fixed  distance  between  models, 
the  power  of  the  F-test  or  the  Chi-square  test  increases  as  the  distance  degrees 
of  freedom  decreases. 

Example  1:    Continuing  with  the  n-variate  normal  example,  we  see  that 

df{6[e,-  02])  =  dfie^]  -  df[e,]  =  {p  +  i)-2  =  p-i. 

Thus,  ©3  is  of  p  -  1  less  dimensions  than  ©2.  Now,  if  we  knew  cr^  the  white 
noise  variance,  we  could  test  Hq  :  0*  e  Q^,  vs.   Hi  :  9*  e  Q2  -  ®z,  using  the 

statistic 

^[Q3;e2]  ^  SSjReg)  2  9) 

0-2  0-2'  \    •    •    ) 

which  has  a  X^(p-l)  null  distribution.  However,  a^  is  not  generally  known  and 
we  must  estimate  it.  One  way  of  estimating  cr^  is  by  estimating  the  distance 
between  [©]  and  [©2],  two  models  that  are  known  to  hold,  and  dividing  by 
the  distance  degrees  of  freedom.  Since  the  distance  degrees  of  freedom  is 
df[Q]-df[Q2]  =n  +  l-{p  +  l)  =  n-p,  we  have  that  the  estimate  of  the  white 
noise  variance  is  D[©2;  ©]/(n-p)  =  S S (Error) / {n  - p). 
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Notice  that  in  the  above  example  the  estimate  of  the  parameter  o"^ 
was  simply  the  estimated  distance  between  two  models  that  were  known  to 
hold  divided  by  their  dimensional  distance.  Quite  generally,  when  the  data 
have  an  exponential  dispersion  distribution  (2.2.7)  with  common  dispersion 
parameter  cr^,  the  estimated  distance  between  two  models  that  are  known  to 
hold,  divided  by  their  dimensional  distance  gives  us  an  estimate  of  cr^.  This  is 
true  when  the  estimated  distance  is  taken  to  be  the  LR,  Wald,  Deviance,  LM, 
or  the  weighted  least  squares  statistics.  These  statistics  are  natural  estimators 
of  the  weighted  distance  given  in  (2.2.8)  for  the  exponential  dispersion  models. 

Now,  let  us  assume  that  ©i  and  ©2  are  each  subsets  of  an  (5  - 
1) -dimensional  simplex.  For  example,  with  count  data,  conditional  on  the 
total  n,  the  distribution  is  often  multinomial  with  index  n  and  parameter 
(alternatively,  probability  distribution  vector)  Q* .  Read  and  Cressie  (1988) 
extensively  study  a  family  of  distance  measures  called  the  power- divergence 
family.  The  power  divergences  have  form 


I\B* 


where  P  and  /~^  are  defined  to  be  the  continuous  limiting  value  as  A  ^  0  and 
A  ^  -1.  It  is  assumed  that  Q*  and  Q  fall  on  an  (5  -  1) -dimensional  simplex. 
As  usual,  let  Q*  represent  the  true  unknown  parameter.  We  define  the  family 
of  distance  measures  between  [©1]  and  [©2]    (©2  Q  ©j)  to  be  proportional  to 

5[©2;©i]  =  2n{inf7^(r,^)-inf/^(^*,^)}. 

©2  ©1 

By  properties  of  I\e* ,B)  (Read  and  Cressie,  1988,  pp.  110-113),  it  follows 
that  5  >  0,  with  equality  if  and  only  if  both  models  hold. 


A 


-1 
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To  estimate  ^[©2;  ©i]  based  on  the  data,  we  note  that  our  least  restrictive 

guess  of  6*  is  F/n,  the  vector  of  sample  proportions.    Intuitively,  a  good 

estimate  of  the  quantity  6 [©2;  ©i]  would  be 
I>[©2;©i]  =  2n{inf/^(y/n,^)-inf/^(y/n,^)} 

=      2      yr.[fJLV_i1 ?^vy.[f  ^^  V 

A(A  +  l)4^  'LU^^V  1  A(A  +  l)^^'[l;5(A)j 
where  9\  and  9\  are  the  'minimum  divergence'  estimators  obtained  by 
minimizing  I^{Y/n,0)  with  respect  to  6  over  ©j  and  ©2  respectively.  Read 
and  Cressie  (1988)  point  out  that  D[©2;©i]  is  equal  to  the  likelihood  ratio 
statistic  when  A  =  0.  Also,  if  we  assume  that  [©1]  holds  so  that  the  second 
infimum  is  zero,  we  have  that,  for  A  =  1, 

which  is  asymptotically  equivalent  to 

where  ^(°)  is  the  maximum  hkelihood  estimator  of  6*  over  the  space  ©2.  This 
is  the  Pearson  chi-square  statistic.  Other  asymptotically  equivalent  distance 
estimates  are  the  Wald  statistic  and  the  Lagrangian  multiplier  statistic.  We 
now  illustrate  these  results  via  examples. 

Example  2:    Suppose  that  Y  =  {Yn,Yi2,Y2i,Y22)'  is  a  multinomial  vector. 
That  is. 


(^11, 5^12^21, ^^22)'  ~  Mu/<(n,(7rn,7ri2,7r2i,7r22)'),    with  J^^tt,.,- =  1. 

Thus,  the  model  that  is  known  to  contain  the  true  parameter  vector  tt*  is  [©] 
where 

©  =  {TT  :  tt'U  =  l,T^ij  e  (0,1),  ij  =  1,2}. 
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Notice  that  0  is  really  a  3- dimensional  subset  (simplex)  of  (0,  l)^  so  that 
4f[0]=4-l  =  3. 

We  wish  to  test  the  independence  hypotheses 

(Hq:     7rn7r22  =7ri27r2i,    vs. 
\  Hi  :     7rii7r22  t"  7'"i2''r2i 

Writing  the  model  of  interest  [0o]  as 

00  =  {tt  e  0  :  7rii7r22  -  '7ri2'7r2i  =  0} 

=  {tt  :  7r'l4  =  I,7rn7r22  -  7ri27r2i  =  0}, 
we  can  state  the  independence  hypotheses  as 

j  Hq-.     tt  g  00,    vs. 
[Hi:     7re0-0o. 

Now,  the  model  degrees  of  freedom  can  be  found  by  subtracting  the  number 
of  constraints  implied  by  [0o]  from  the  total  number  of  parameters,  which 
is  4.  Hence,  d/ [0o]  ==4-2  =  2.  Thus,  the  distance  degrees  of  freedom  or 
measure  of  dimensional  distance,  is  df(6[Qo;  0])  =  3-2  =  1. 

Two  distance  (goodness-of-fit)  statistics  commonly  used  are  the  Pearson 
chi-square  X^  (A  =  1)  and  the  likelihood  ratio  statistic  G^  (A  =  0).  The  forms 
of  these  two  statistics  are 

and 

D[0o;0]  =  G2  =  2  5;5:y,,log(-|^), 

where  ttjj^o  is  the  ML  estimate  of  ttjj  assuming  that  model  [0o]  holds. 

Under  the  null  hypothesis,  i.e.  if  independence  truly  holds,  then  the 
asymptotic  distribution  of  both  distance  statistics,  X"^  and  G^,  is  X^(l)- 
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Example  3:    Continuing  with  example  2,  consider  the  model  [©Af^r]  where 

^MH  =  {tt  :  7r'l4  =  1,  7ri+  -  tt+i  =  0}. 

This  model  implies  that  there  is  marginal  homogeneity,  i.e.    The  marginal 
distributions  for  both  factors  are  the  same. 
We  would  like  to  test  the  hypotheses 

j  Hq:   tt  e  @MH,    vs. 

The  model  degrees  of  freedom  is  fl[f  [©Mff]  =4-2  =  2,  and  so  the  distance 
degrees  of  freedom  is  c(f(^[©Affl^)  ©])  =  3-2  =  1.  Once  again,  to  illustrate 
what  model  degrees  of  freedom  means,  we  observe  that  if  [©mf]  holds  and 
we  specify  two  of  the  four  probabilities,  the  remaining  two  are  completely 
determined.  Thus,  we  are  free  to  estimate  two  of  the  probabilities  based  on 
the  data.  The  other  two  are  determined. 

Two  frequently  used  estimates  of  the  model  distance,  or  model  goodness 
of  fit  are  the  likelihood  ratio  statistic  G^  and  the  McNemar  statistic  M"^.  For 
2x2  tables,  the  McNemar  statistic  and  the  Lagrange  Multiplier  statistic  are 
equivalent  since  both  are  score  statistics  (Agresti,  1990;  Aitchison  &  Silvey, 
1958).  The  statistics  take  the  following  forms 

Vij 


£»[0mf;©]  =  G^  =  25:5:  y,,- log  (-1^), 


and 


yi2  + 1/21 

where  the  ■nij^Q  in  the  first  expression  is  the  ML  estimate  of  -Kij  under  the 
model  [Qmh]- 
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Under  the  null,  i.e.  when  the  n:iarginal  distributions  are  homogeneous, 
both  of  these  statistics  have  asymptotic  X^(l)  distributions. 

It  is  important  to  note  that,  had  the  constraint  7r2+  -7r+2  =  0  been  added, 
the  model  would  remain  consistent  but  would  be  ill  defined.  For  2x2  tables, 
this  additional  constraint  is  exactly  the  same  as  the  constraint  tti^  -  tt+i  =  0. 

2.3     Multivariate  Polytomous  Response  Model  Fitting 

In  this  section,  we  describe  ML  model  fitting  for  an  integer  valued 
random  vector  Y  that  is  assumed  to  be  distributed  product-multinomially. 
We  also  investigate  the  asymptotic  behavior  of  the  ML  estimators  within  the 
framework  of  constraint  models.  The  models  we  will  consider  have  form 

©x  =  {^  e  0  :  Clog{Ae^)  =  X0,  Le^  =  0} 

or  equivalently,  for  appropriately  chosen  U, 

ex  =  eh  =  {Cee:  U'C\og{Ae^)  =  0,  Le^  =  0}, 

where  e^  is  the  s  x  1  mean  vector  of  Y,  a  product-multinomial  random  vector 
and  the  model  parameter  space  0  is  of  dimension  5- g,  where  q  is  the  number 
of  identifiability  constraints.  We  use  the  parameter  ^  rather  than  fx  =  e^ 
for  several  reasons.  One  reason  will  become  evident  when  we  explore  the 
asymptotic  behavior  of  the  ML  estimator  of  ^.  It  turns  out  that  the  random 
variable  p,  -  fiQ  is  not  bounded  in  probabihty,  whereas  ^  -  ^o  is.  In  fact,  the 
random  variable  ^-^o  converges  in  probability  to  0.  Another  reason  for  using 
^  rather  than  ft  is  that  the  procedure  for  deriving  the  maximum  likelihood 
estimate  of  ^  is  less  sensitive  to  small  (or  zero)  counts.  The  range  of  possible 
^  values  is  the  whole  real  fine,  while  the  range  of  possible  ^  values  is  restricted 
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to  the  positive  half  of  the  real  line.  By  using  £  the  problem  of  intermediate 
out  of  range  values  (e.g.  negative  cell  mean  estimates)  is  avoided. 

As  stated  above,  we  initially  assume  that  the  vector  of  cell  counts  Y 
has  a  product-multinomial  distribution.  This  is  not  overly  restrictive  since  it 
will  be  shown  that  inferences  based  on  maximum  (multinomial)  likelihood 
estimates  are  often  the  same  as  inferences  based  on  maximum  (Poisson) 
likelihood  estimates.  We  will  present  some  results  in  section  2.4  that  allow 
us  to  determine  when  these  inferences  axe  indeed  the  same. 

We  also  consider  an  alternative  method  for  computing  the  maximum 
likelihood  estimators  and  their  asymptotic  covaxiances.  The  method  of 
Lagrange  undetermined  multipliers  is  well  suited  for  maximum  likelihood 
fitting  of  the  models  we  will  be  considering.  This  is  so  because  we  will  specify 
the  models  in  terms  of  constraint  equations  and  the  fitting  problem  will  be 
one  of  maximizing  a  function,  namely  the  log  likelihood,  subject  to  some 
constraints,  namely  that  ^  G  0/,. 


2.3.1     A  General  Multinomial  Response  Model 


In  this  section  we  specify  a  class  of  models  that  is  directly  applicable 
to  Chapter  3  of  this  dissertation.  Specifically,  the  models  will  be  specified  in 
such  a  way  so  as  to  include  the  class  of  simultaneous  models  for  the  joint  and 
marginal  distributions  considered  in  Chapter  3. 
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Let  the  random  vector  Y  =  vec(yi  ,...,Yk)  denote  a  product  multinomial 
random  vector,  i.e. 

Yi  =  {Yii,...,YiRy  ~  indMult(ni,7ri),    i  =  l,...,K,    K>1, 

where  the  Rxl  vector  of  cell  probabilities  satisfy  7r|l^  =  1,    i  =  l,...,K. 

Consider  the  1:1  reparameterization  from  {ttj}  to  {^J,  where  ^i  = 
log(Ai,)  =  log(ni7ri)  is  an  i?  x  1  vector  of  log  means.  Under  this  parame- 
terization. 


Yi   ~  ind  Mult(ni,  — ),    e^'ilR  =  n,-,    i  =  l,...,K, 


or 


Yi  ~  indMult(ni,^),     i  =  l,...,K,     e^'(ef li?)  =  n',  (2.3.1) 

where  n'  =  (uj, . . . ,  n^)  is  the  Ix  K  vector  of  multinomial  indices. 

The  kernel  of  the  log  likelihood  for  Y,  written  as  a  function  of  ^,  is 

^^''Ki\y)  =  y'L    e^'(8flii)  =  n'  (2.3.2) 

We  now  posit  a  model  for  ^,  the  vector  of  log  means.  Let  s  =  RK  be  the 
total  number  of  cell  means.  Our  objectives  are  to  test  the  model  goodness 
of  fit  and  to  estimate  the  5x1  model  parameter  vector  ^  as  well  as  any 
freedom  parameters  of  interest.  It  will  be  assumed  that  the  model  [0^]  can 
be  specified  as 

Qx  =  {i^R':  CilogAie^  =  Xi^i.Calog^je^  =  X2^2,Le^  =  0, 

(2.3.3) 
e^'(eflij)  =  n'}, 
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where 

Ci  =  ®iCij,        Cij  =  Cii,  is  Qi  xrrii    i  =  1,2 

Ai  =  ef  Ajj,        Aij  =  An,  is  rriixR,    i  =  l,2 

L  =  ®fLj,        Lj  =  Lx  is  dx  R 

£  =  vec(^i, . . . ,  ^k),        and  ^^  is  i2  x  1 
Xi      is  Kqi  X  Pi  of  full  rank  pi,    i  =  1,2 

n  is  the  K  xl  vector  of  multinomial  indices 

s  =  RK,    the  total  number  of  cells 
Let  us  say  that  a  model  that  can  be  specified  as  in  (2.3.3)  satisfies 

assumption  (Al).  That  is, 

(Al)     The  multinomial  response  model  can  be  specified  as  in  (2.3.3). 

Notice  that  the  K  matrices  of  Ci  are  all  identical,  likewise  with  the 
matrices  comprising  Ai  and  L.  This  requires  that  the  model  does  not  change 
across  the  K  populations  {K  multinomials).  Also,  the  two  sets  of  freedom 
equations  in  (2.3.3)  will  allow  us  to  use  two  different  types  of  models  for 
the  expected  cell  means.  This  provides  us  with  enough  generality  to  fit 
many  interesting  models.  For  example,  we  may  wish  to  simultaneously  fit 
a  linear-by-linear  association  loglinear  model  for  the  joint  distribution  and  a 
cumulative  logit  model  for  the  marginal  distributions. 

We  can  conveniently  rewrite  (2.3.3)  as 

&x  =  {^eR':  C\og{Ae^)  =  X(3,Le^  =  0,e^'(®f  1^)  =  n'},  (2.3.4) 

where  A'  =  [A\,A'^\,  C  =  Ci®C2,  X  =  Xi  ®  X2,    and  /3  =  vec(/3i,/?2). 

Notice  that  the  model  [0^]  is  specified  in  terms  of  both  freedom 
equations  and  constraint  equations.    We  will  rewrite  [©x]  as  a  constraint 


u 
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model  keeping  in  the  back  of  our  minds  that  the  freedom  parameters  may  be 
of  interest  also. 

Let   U  he  a.  K{qi  +  92)  x  u  matrix  of  full  column  rank  u  such  that 
U'X  =  0.     Here  u  is  the  dimension  of  the  null  space  of  X\  A/'(X'),  i.e. 

■^(51  +  52)  -  (pi  +^2)-  Since  U  can  be  chosen  to  be  of  full  column  rank,  it 
follows  that  the  columns  of  U  form  a  basis  for  the  null  space  of  X' .  Thus,  the 
range  space  of  U  equals  the  null  space  of  X',  i.e.  M{U)  =  J^{X').  Multiplying 
the  right  and  left  hand  side  of  the  freedom  equation  Clog(^e^)  =  X/3  by  U\ 
we  can  rewrite  (2.3.4)  as 

Qh  =  {i^R':  U'C\og{Ae^)  =  0,Le^  =  0,e^'(ef  Ij?)  -  n' =  0}.  (2.3.5) 

Thus,  ©x  =  0;,  and  the  models  [0^]  and  [0^]  are  one  and  the  same. 

At  this  point,  we  will  assume  that  the  constraints  implied  by  the  model 
[0ft]  are  nonredundant  so  that  the  model  is  well  defined.  More  specifically,  let 
h'{i)  =  [{U<C\og{Aei))\e^'L<]  be  the  lx{u  +  l){l  =  Kd)  vector  of  constraint 
functions.  We  will  assume  that  the  u-\-l  +  K  constraints  implied  by  h[^)  =  0 
and  e^'(ef  Ijj)  =  n'  are  nonredundant.  Notice  that  the  constraints  in  h{^)  =  0 
do  not  include  the  identifiability  constraints.  We  treat  the  identifiability 
constraints  separately  for  reasons  that  will  become  apparent  when  we  actually 
fit  the  models. 

As  stated  previously,  one  of  our  primary  objectives  is  to  estimate  the 
model  parameters  ^  and  the  freedom  parameters  ^  under  the  assumption 
that  [Qx]  (and  [0/,])  holds.  We  will  use  the  maximum  likelihood  estimates, 
which  can  be  found  by  maximizing  the  log  likelihood  of  Y  subject  to  the 
constraint  that  [0^]  holds. 
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The   (kernel  of  the)    log   likelihood   under   the   product   multinomial 
assumption  is  shown  in  (2.3.2).  It  is 

Thus,  we  are  to  maximize  the  function  i^^\^;y)  =  y'^  subject  to  (^  g  0^. 

2.3.2     Maximum  Likelihood  Estimation 

In  this  section  we  will  discuss  two  procedurally  different  approaches 
to  maximizing  the  log  likelihood  i^^\C'^y)  subject  to  ^  e  0/,.  The  first 
approach,  which  is  the  more  commonly  used  approach,  requires  that  the 
model  be  specified  entirely  in  terms  of  freedom  equations.  Often  times, 
when  there  are  no  identifiability  constraints,  the  model  can  be  completely 
specified  as  a  freedom  model.  Models  amenable  to  this  approach  include  the 
Poisson  loglinear  model  and  the  Normal  linear  model.  The  second  approach, 
Lagrange's  method  of  undetermined  multipliers,  can  be  directly  applied  when 
the  model  is  specified  completely  in  terms  of  constraint  equations.  Since  the 
product  multinomial  model  includes  identifiability  constraints,  it  can  more 
easily  be  specified  in  terms  of  constraint  equations.  For  this  reason  this 
second  method  is  the  preferred  choice.  In  the  following  sections,  we  discuss 
some  additional  features  of  these  two  methods. 

Freedom  Parameter  Approach.  One  approach  often  used  in  simple  situa- 
tions, namely  those  situations  when  the  model  can  be  specified  completely 
in  terms  of  freedom  equations,  is  to  write  the  parameter  ^  as  a  function 
of  the  freedom  parameter  ^  and  maximize  ^^^\^{f3);y)  with  respect  to  /3. 
The  vector  ^(/?)  will  be  in  the  model  space,  since  the  model  was  specified 
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completely  in  terms  of  p.  For  example,  if  the  model  could  be  specified  as 

then  ^{(3)  =  X/3.  Notice  that  the  multinomial  model,  which  includes  the  K 
constraints  e^'(®f^l^)  =  n',  is  not  directly  amenable  to  this  approach.  In  fact, 
we  would  have  to  reparameterize  to  a  smaller  set  ^*  oi  s-K  model  parameters 
that  account  for  the  K  constraints.  This  reparameterization  results  in  an 
asymmetric  treatment  of  the  ^  and  for  that  reason  is  deemed  undesirable. 
On  the  other  hand,  the  Poisson  model  considered  below,  will  often  lend  itself 
to  this  maximization  approach,  since  the  K  constraints  e^'(®^l^)  =  n'  are 
not  included. 

Computationally,  the  method  of  maximizing  the  log  likelihood  with 
respect  to  the  freedom  parameters  is  usually  simple.  Assuming  the  log 
likelihood  is  concave  and  difFerentiable  in  /?,  we  need  only  solve  for  the  root 
of  the  'score  equations',  viz. 

Many  of  the  asymptotic  properties  of  the  maximum  likelihood  estimator 
/3  for  j3  are  derived  by  formally  expanding  the  score  vector  s{/3]  y)  about  the 
true  value  /5  =  /3*  in  a  linear  Taylor  expansion.  That  is, 

5(/?; y)  =  s{/3*;y)  +  ^^^§^{P  - /?*)  +  0(||/3  -  /3*|p)  (2.3.6) 

In  particular,  in  many  situations, 

O^sC^-Y)  =  si/3*;Y)  +  ^'f^p0-(3*)  +  Op{l), 
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so  that  (3  -  /3*  has  the  same  asymptotic  distribution  as 


(-^^)">;n 


Subsequently,  we  will  derive  the  asymptotic  distribution  of /?-/3*  in  a  different 
way.  This  alternative  derivation  of  the  asymptotic  distribution  of  the  freedom 
parameter  estimate  will  shed  new  light  on  the  relationship  between  the 
asymptotic  behavior  of  the  estimates  under  the  two  sampling  assumptions — 
product  Poisson  and  product  multinomial. 

Expression  (2.3.6)  also  gives  some  indication  of  how  one  might  numer- 
ically solve  for  /3,  the  root  of  the  score  equation.  A  Newton-Raphson  type 
algorithm  is  often  used.  This  root  finding  algorithm  involves  the  inversion 
of  the  derivative  matrix  ds{(3;y)/d^',  which  is  usually  of  small  dimension 
since  the  model  is  usually  specified  in  terms  of  a  small  number  of  freedom 
parameters.  In  fact,  the  dimension  of  the  derivative  matrix  will  not  be  larger 
than  s  X  s,  which  occurs  when  the  model  is  saturated. 

Constraint  Equations  Approach.  In  many  situations,  it  may  be  difficult  to 
specify  a  model  in  terms  of  only  freedom  parameters  or  perhaps  it  is  possible 
but  the  researcher  would  like  to  treat  the  model  parameters  symmetrically, 
which  would  necessitate  an  additional  constraint  equation.  It  also  could  be 
that  the  function  Clog  Ae^  is  not  a  1:1  function  of  £  so  that  for  given  (3,  we 
can  not  solve  for  ^  explicitly.  In  any  of  these  cases,  we  may  not  be  able  to 
use  the  aforementioned  maximization  approach. 

In  this  section,  we  consider  an  alternative  method  for  finding  that  ^ 
that  maximizes  the  function  i^^^C'iV)  subject  to  ^  g  ©/,.  The  method  we 
will  use  is  the  Lagrange's  method  of  undetermined  multipliers.  Aitchison  and 


-51- 
Silvey  (1958,  1960)  and  Silvey  (1959)  provide  much  of  the  essential  underlying 
theory  related  to  this  approach.  Three  positive  features  of  this  method 
include  (i)  estimation  of  both  ^  and  /?  is  possible,  (ii)  the  method  provides 
us  with  another  enlightening  way  of  deriving  the  asymptotic  distribution 
of  the  freedom  parameter  estimators,  and  (iii)  the  method  works  quite 
generally.  A  negative  feature  of  this  approach  is  the  computational  difficulty. 
Computationally,  the  method  becomes  burdensome  as  s,  the  number  of  log 
mean  parameters,  and  u  +  l  +  K^  the  number  of  constraints  implied  by  the 
model,  become  large.  In  fact,  the  algorithm  involves  the  inversion  of  an 
{s  +  u  +  l)x{s  +  u  +  l)  matrix.  One  positive  note,  is  that  this  potentially  very 
large  matrix  does  have  a  simple  form  and  one  can  invoke  some  simple  matrix 
algebra  results  to  reduce  the  inversion  problem  to  one  of  inverting  matrices 
of  dimensions  (u  +  l)  x  {u  +  I)  and  s  x  s. 

To  best  illustrate  the  difference  in  computational  difficulty  of  the  two 
methods,  we  consider  the  following  normal  linear  model  example.  Let 

Yi  ~  mdN{fii=/3o+l3iXi,a^),    i  =  1,2,. .  .,100,  tr^  known. 

The  log  likeHhood  can  easily  be  written  as  a  function  of  /5  =  (/3o,/?i)'. 
Maximizing  this  likelihood  with  respect  to  f3  involves  working  with  a  2  x  2 
matrix.  On  the  other  hand,  we  could  equivalently  specify  the  linear  model  in 
terms  of  the  98  constraints, 

t^i+i  -f^i^  fii+2  -  fii+1      ,-^12,. ..,98, 

Xi+i  -  Xi  Xi+2  -  Xi+i 

and  use  Lagrange's  method.  In  this  case,  we  would  need  to  invert  a  matrix 
which  has  dimension  (s  +  u  +  l)  x  (s  +  u  +  l)  =  198  x  198. 
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Even  when  we  use  the  matrix  algebra  results  that  simplify  the  problem 
of  working  with  the  198  x  198  matrix,  we  still  are  left  with  a  formidable  task. 
It  seems  that  when  5  is  large  and  the  model  is  parsimonious,  i.e.  u  +  l  +  K, 
the  number  of  constraints  is  large,  the  undetermined  multiplier  method  may 
not  be  the  method  of  choice.  However,  in  time,  as  computer  efficiency  gains 
are  realized,  we  predict  that  the  scope  of  candidate  models  to  be  fit  using 
this  method  will  increase  tremendously.  In  fact,  at  present,  many  categorical 
models  can  easily  be  fit  using  Lagrange's  method.  We  discuss  in  more  detail 
how  we  can  use  the  method  of  undetermined  multipliers  to  fit  models  like 
[0;.]  of  (2.3.5). 

We  are  to  maximize  the  function  £(^)((^;  y)  =  y'^  subject  to  the  constraint 
^  e  0/,,  where 

e„  =  {^eR':  U'C\og{Ae^)  =  0,Le«  =  0,e^'(ef  Ij^)  -n'  =  0} 

=  {^Gi?-MO  =  0,e^'(efl«)-n'}, 
and  h'{^)  =  [\og{e^' A')C'U,    e^' L']. 

Consider  the  Lagrangian  objective  function 

F(7)  =  £(^)(e;  y)  +  (e^'(ef  li^)  -  n')T  +  h'iOX, 

where  7  =  vec(^,r.  A).   The  K  xl  vector  r  and  the  {u  +  l)  xl  vector  A  are 
called  either  'Lagrange  multipliers'  or  'undetermined  multipliers'. 

Provided  a  maximum  ^  exists  and  that  the  Jacobian  of  [e^'(®^l^)  - 
n',h'{^)]  is  of  full  row  rank  u  +  /  +  iiT  for  all  ^  G  0/,,  we  can  solve  for  the 
maximum  by  solving  the  system  of  equations 

/y  +  £'(e^'''')(8f  liz)fW+iy(JW)AW^ 
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(®f  ige^^^'^-n 


=  0         (2.3.7) 
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where  the  matrix  H{^)  =  dh'{^)/d^.  The  Jacobian  condition  basically 
requires  the  constraints  to  be  nonredundant,  thereby  making  [0;,]  a  well- 
defined  model. 

From  this  point  on,  for  notational  convenience,  the  indices  for  the  direct 
sum  ©  will  be  omitted  unless  they  are  different  from  1  and  K. 

We  now  require  the  matrices  of  models  [0^]  and  [0/,]  to  satisfy  some 
additional  conditions.  Let  us  assume  that 

(A2)     Either  Ci=Ig.KOvCi{®lm,)  =  0,    i  =  l,2 
and 

(A3)     IfQ  =  4j^then     m(X,)  D  A^(®1„.J 


The  assumptions  require  Cj  to  be  either  a  contrast  matrix  (rows  sum 
to  zero),  a  zero  matrix,  or  the  identity  matrix.  If  Ci  is  the  identity  matrix, 
it  will  be  required  that  there  exists  a  set  of  columns  in  Xi  that  spans  a 
space  containing  the  range  space  of  ©f^lm;-  For  most  models  of  interest 
these  conditions  are  met.  For  example,  any  of  the  logit  type  models,  such  as 
cumulative  or  multiple  logit  models,  can  be  specified  with  C  being  a  contrast 
matrix.  For  loglinear  models,  the  condition  (A3)  is  met  whenever  the  model 
includes  a  parameter  for  each  of  the  K  multinomials. 

The  following  lemma  will  be  useful  in  showing  that  the  maximum 
likelihood  estimates  of  ^  and  /3  are  equivalent  under  both  sampling  schemes — 
product-Poisson  and  product-multinomial.  The  lemma  will  also  enable  us  to 
reduce  the  number  of  equations  in  (2.3.7)  that  must  be  simultaneously  solved 
when  computing  the  maximum  (multinomial)  likelihood  estimators. 
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Lemma  2.3.1.    //  the  matrices  of  models  [&x]  o-f^^  [Q/i]  satisfy  (Al),  (A 2), 
and  (A3),  then  provided  the  model  holds 


Proof.    Using  matrix  derivatives  (MacRae,  1974;  Magnus  and  Neudecker, 
1988),  it  follows  that 


Thus, 
{®^'R)m) 


H{i)  =  p(e^)A'L>-^(Ae«)C"C/,  D{^^)L'\ 

(®e^:)A'D-i(Ae^)C"t7,  (®e^i)L' 
\®e^'i)[A\,A',]D-'  (^^^^^)  {C[®C',)U,    ®e^V, 
[{@e^'i)A\D-\A,e^),  {®e^'i)A'^D-\A2e^)]{C',®C'^)U,    o' 
[{®e^'iA\,)D-\A,e^)C[  {®c^iA',,)D-\A,e^)C'^]U,    0 
[(®lU)Ci.  {®^'m,)C',]U,    0 
0,0 


The  third  equality  follows  since  the  model  holding  implies  that  ®&^'iL\  =  0. 
The  sixth  equality  can  be  seen  via  the  following  argument. 

If  both  Ci's  are  contrast  matrices,  or  zero  matrices,  then  (A2)  implies 
that  the  matrix  [(©ImJCJ,  (®lmj)C'2]  ^^  the  zero  matrix.  On  the  other  hand, 
if  both  Cj  and  C2  are  identity  matrices,  then  since  the  columns  of  U  span 
the  null  space  of  X',  which,  by  (A3),  implies  that  the  columns  of  U  span  a 
set  contained  in  the  null  space  of 
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we  have  that  [(  ®  i;„J,  (  e  i;„J]?7  =  0.  Any  other  combination  of  Cj  and  C2 
can  also  be  seen  to  result  in  the  matrix  equaling  zero.  m 

The  following  theorem  gives  conditions  under  which  we  can  find  the  ML 
estimators  of  (  by  solving  a  reduced  set  of  equations.  The  smaller  system  of 
equations  no  longer  includes  the  identifiability  constraint  equations. 

Theorem  2.3.1  Let  vec(J(^),  f(^),  A(^))  be  the  solution  to  (2.3.7). 
Assuming  that  (Al),  (A2),  and  (A3)  hold,  the  sub-vector  vec(^(^),  A(^)) 
is  the  solution  to  the  reduced  set  of  s  +  u  +  I  equations 

Proof:   Premultiplying  the  first  set  of  equations  in  (2.3.7)  by  eljj,  we  arrive 
at 

(  ©  l'R)y  +  (  e  l'R)D{e^''''){  ®  lR)r  +  (  e  l'^)i?(|W)AW  =  0  (2.3.9) 

Now,  ( ©  l'^)y  =  n  and  ( ®  l^)L>(e^^*'^)  =  ee^^*"^' .  Also,  since  ^(^)  e  0^  it  must 
be  that  (  ©  e^.-      )(  ©  1_r)  =  D{n),  the  diagonal  matrix  with  the  multinomial 
indices  on  the  diagonal.  Further,  by  Lemma  2.3.1, 
(  ©  Vj^)H{liM))  =  0.  Therefore,  (2.3.9)  can  be  rewritten  as 

n  +  D(n)f(^)=0, 

which  implies  that  r^^^  =  -Ik-  Now,  since  the  identifiability  constraints  have 
been  explicitly  accounted  for  when  solving  for  f^-'^),  we  can  replace  f(-^)  of 
(2.3.7)  by  -Ik  and  omit  the  identifiabiHty  constraints.  Thus,  vec(J(^),  A(^)) 
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is  the  solution  to  the  reduced  set  of  equations 

/y-e^^"^+H(tW)AW\      0 
V  hCe^))  J 

This  is  what  we  set  out  to  show.  g 

Before  detaihng  the  iterative  scheme  used  for  solving  (2.3.8),  we  will 
explore  the  asymptotic  behavior  of  the  estimator  6^^^  =  vec(^(-'^),  A^-'^^) 
within  the  framework  of  constraint  models. 

2.3.3     Asymptotic  Distribution  of  Product-Multinomial  ML  Estimators 

In  what  follows,  we  will  assume  that  K,  the  number  of  identifiability 
constraints,  is  some  fixed  integer,  K  >  1.  We  also  will  assume  that  the 
asymptotics  hold  as  n*  =  min{ni}  approaches  infinity  and  that  n*  ~  rij,  i  = 
1,...,K.  That  is,  we  assume  that  the  asymptotic  approximations  hold  as 
each  of  the  multinomial  indices  get  large  at  the  same  rate. 

The  derivation  of  the  asymptotic  distribution  of  6^^^  will  follow  closely 
that  of  Aitchison  and  Silvey  (1958).  Briefly,  Aitchison  and  Silvey  show  that 
if  the  score  vector  is  op{n)  and  the  constraints  are  such  that  the  derivative 
matrices  H{^)  and  dH'{^)/d^  have  elements  that  are  bounded  functions  then, 
provided  certain  mild  regularity  conditions  hold,  the  maximum  likelihood 
estimator  ^  is  an  n~^/^ -consistent  estimator  of  ^o  and  A  is  an  n^/^ -consistent 
estimator  of  0.  They  show  that  the  joint  distribution  of  (n^/^(^  -  ^o))'^"''^^'^) 
is  multivariate  normal  with  zero  mean  and  covariance  matrix 

{B-'-B-^H{H'B-^H)-'H'B-^  0  \ 

1,  0  [H'B-'^H)-^) 

where  B  is  the  information  matrix  and  H  is  the  derivative  of  the  constraint 
function. 
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In  our  application,  however,  there  are  some  minor  changes.  With  the  pa- 
rameterization we  use,  the  information  matrix  is  zero  since  the  (multinomial) 
log  likelihood  (2.3.2)  is  linear  in  the  parameter  ^.  This  happens  because  the 
identifiability  constraints  e^'(  ©j^  Ir)  =  n'  are  ignored,  to  preserve  symmetry, 
when  differentiating.  Also,  in  our  parameterization,  the  constraints  are  in 
terms  of  e^,  the  components  of  which  are  e^''^'  =  niWij.  Thus,  the  constraints 
and  the  corresponding  derivative  matrices  may  not  be  bounded.  For  example, 
a  typical  constraint  is  of  the  form  Le^  =  0.  It  follows  that  the  components 
of  Le^  and  the  derivatives  are  increasing  without  bound  as  the  multinomial 
indices  are  allowed  to  increase  without  bound. 

Fortunately,  we  can  still  use  the  results  of  Aitchison  and  Silvey  (1958) 
by  replacing  the  matrix  H  and  the  vector  A/n  of  Aitchison  and  Silvey  by 
our  H/n^  and  A,  where  n*  =  min{n,}.  The  zero  information  problem  can  be 
solved  by  identifying  the  vector  Y  -  e^  as  the  'score  vector'.  It  is  pointed  out 
that,  in  this  case,  the  asymptotic  variance  of  ®Z)~^/^(njl^)  times  the  score 
vector  is  not  equal  to  the  negative  derivative  matrix  D^ttq)  but  instead  is 
equal  to  D^ttq)  -  ©ttojTTq^.  This  happens  because  the  components  of  Y  are  not 
independent;  Y  is  product  multinomial.  Using  this  reparameterization,  all  of 
the  necessary  assumptions  required  by  Aitchison  and  Silvey  (1958)  hold,  i.e. 
assumptions  X  and  H  of  Aitchison  and  Silvey  (1958)  hold. 

As  previously  mentioned,  Aitchison  and  Silvey  show  that  A  is  an 
n^/^ -consistent  estimator  of  0.  With  our  paramterization,  having  replaced 
A/n  by  A,  it  follows  that  A(^)  will  be  n^jT^'^ -consistent.  We  now  derive  the 
asymptotic  distribution  of  6^^h 
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Define  the  stochastic  function  g  by 

The  maximum  Hkehhood  estimator  6(^^  is  the  solution  to  g{d;  Y)  =  0. 

Under  our  parameterization,  using  the  results  of  Aitchison  and  Silvey 
(1958),  we  have  that  each  of  the  following  hold 

e^^"^-e^o=i)(e^o)(^(M)_^^)  +  C>p(l), 

hi^""^)  =  HCo)  +  i?'(6)(e(^^  -  6)  +  Op(l) 

=  ^'(eo)(e(^)-6)  +  Op(l), 


and 


Thus, 


if  (J(^))AW  =  if  (eo)A(^)  +  Op{l). 


can  be  rewritten  as 


Q  ^  /r  _  e6  _  L>(e^o)(|(M)  _  ^^)  +  F(^„)A(A^)  +  Op(l)^ 
V  if'(6)(|(^)-eo)  +  Op(l) 


y-ei 
0 


)      \-H'{^o)  0       )[     AW     J+^HIJ. 

Therefore,  it  follows  that 

F-e^oA  ^ 

(2.3.10) 


®D-y^nilR)^       ^ 


since  n^  ^  Ui,    i  =  l,...,K  and  Xq  =  (  e -D'H^il^))^^"- 
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Now,  the  random  variable  ®D~^/2(njl^)(y-e^<')  is  a  vector  of  normalized 
sample  proportions  so  that 

UD-y^{nilR){Y-e^o)\ 
has  an  asymptotic  normal  distribution  with  zero  mean  and  covariance  matrix 


/D(7ro)-e7roi7r|,.     0\ 

V  0  o) 


Therefore,  by  an  extension  of  a  theorem  of  Cramer  (1949)  and  by  equation 
(2.3.10),  it  follows  that  nV\e(^^)  -  $,)  =  n'J\ecC^(^)  -  6,A(^))  has  an 
asymptotic  normal  distribution  with  mean  zero  and  covariance 


DM  -^y  (. 


This  covariance  matrix  is  shown  in  the  appendix  to  have  the  simple  form 

(M,      0  >| 
V  0      M2) 

where 

Ml  =  D-'{7ro)  -  D-\^o)H{H'D-\7ro)H)-'H'D-\7ro)  -  Q^IrVr 

and 

M2=nl{H'D-\7ro)H)-\ 

Finally,  using  the  fact  that  n^,  r^  Ui,  i  =  1,.  ..,K,we  can  discriminantly 
replace  n*  by  the  appropriate  n^  to  arrive  at  a  simple,  asymptotically 
equivalent,  expression  for  the  asymptotic  covariance  of  6^^^  =  vec(^(^),  A(^)). 
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It  is 

(d-'  -  D-^H{H'D--'H)-^H'D-'  -  ©^^  0  \  .3  3  12) 

V  0  '       {H'D-^H)-^)'        ^  ■  ■     ^ 

where  D  =  L>(/io)  =  D{e^o)  and  H  =  F(^o). 

2.3.4     Lagrange's  Method — The  Algorithm 

In  this  section,  we  give  details  of  how  one  can  actually  fit  the  models 
of  (2.3.4)  or  equivalently  (2.3.5).  We  show  how  Lagrange's  undetermined 
multipliers  method  can  be  used  in  conjunction  with  a  modified  Newton- 
Raphson  iterative  scheme  to  compute  the  ML  estimators  and  their  asymptotic 
covariances.  We  will  assume  that  the  model  assumptions  (Al),  (A2),  and  (A3) 
hold.  This  section  includes  an  outline  of  the  algorithm  used  in  the  FORTRAN 
program  'mle. restraint'. 

Recall  that  our  objective  is  to  find  that  ^(■'^)  e  ©x^  where 

©x  =  {$eR':     C\og{Ae^)  =  X^,    Le^  =  0,    (©l'^)e^  =  n}, 
that  maximizes  the  multinomial  log  likelihood 

Since  the  assumptions  (Al),  (A2),  and  (A3)  hold,  we  see  by  Theorem 
2.3.1  that  our  problem  is  reduced  to  one  of  solving  the  system  of  equations 
(2.3.8),  i.e.  to  find  the  ML  estimator  ^(^)  =  vec(J(^),  A(^))  we  must 
simultaneously  solve  the  system  of  5  +  u  +  /  equations 


^W  =  (^-%f«'^)=°. 
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where  the  [u  +  l)  xl  vector  h  and  the  s  x  {u  +  I)  matrix  H  axe  defined  as 
follows. 


and 


fc(e)  =  (^'C'M^<^')) 


H(e)  =  Me) 


It  will  be  shown  in  section  (2.4)  that  g{9)  is  actually  the  derivative 
of  the  Lagrangian  objective  function  under  the  product-Poisson  sampling 
assumption. 

The  iterative  scheme  used  in  the  FORTRAN  program  'mle.restraint'  is 
a  modified  Newton-Raphson  algorithm.  The  algorithm  can  be  sketched  as 
follows. 

(1)  Find  a  starting  value  for  $. 

(2)  Replace  ^H  by  ^("+1)  =  ^M  -  G-i(^H)^(^(''))  (2.3.13) 

(3)  If  1|^(^(''+^))||  >  tol  go  to  (2).  Else  stop. 


The  matrix  G{0)  used  in  step  (2)  is  actually 

and  the  inverse  of  G{d)  is  of  the  very  simple  form  (see  Aitchison  and  Silvey, 
1958  or  Rao,  1974) 

r-i(f)\-     fD-^-D-^H(H'D-^H)-'H'D-^     -D-'H(H'D-^H)-''\ 
^     ^^)--[  -{H<D-^H)-^H'D-^  -{H'D-^H)-^       )' 

(2.3.14) 
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where  D  —  D{e^).  Since  we  use  G{9)  in  place  of  the  Hessian  matrix,  the 
procedure  is  a  modification  to  the  Newton-Raphson  method.  Haber  (1985a) 
used  the  more  comphcated  Hessian  matrix. 

Notice  that  the  inversion  of  G,  which  may  be  performed  at  each  iteration, 
is  not  nearly  as  difficult  as  inverting  a  general  matrix  of  dimension  (s  +  u  + 
/)  X  (s  +  u  +  I).  First  of  all,  in  view  of  (2.3.14),  to  obtain  the  inverse  of  the 
partitioned  matrix  G,  we  need  only  invert  the  matrices  D  and  H'D~^H,  which 
are  of  dimension  s  x  s  and  (u  +  l)  x  {u  +  l).  Secondly,  the  inversion  of  D  is 
simple  since  D  is  a  diagonal  matrix  with  e^  on  the  diagonal.  Hence,  the  most 
formidable  task  in  the  inversion  process  is  the  inversion  of  the  symmetric 
positive  definite  matrix  H'D~^H.  There  are  many  efficient  ways  to  invert 
large  symmetric  positive  definite  matrices. 

Upon  convergence  of  the  algorithm  (2.3.13),  estimates  of  the  asymptotic 
covariances  of  ^(^)  and  A(^)  are  readily  calculable.  Write  G-^(^)  of  (2.3.14) 
as 

G-m  =  {^  I), 

where 

P  =  D-^-  D-'H{H'D-'H)-'^H'D-^ 

Q  =  -D-'H{H'D'^H)-^ 

By  (2.3.12),  the  asymptotic  covariance  of  ^(^)  =  vec(J(^),  A(^))  can  be 
estimated  by 

var(^W)=[^-®f^       0^] 

Variance  estimates  for  other  continuous  functions  of  9^^\  such  as 
^(M)  ^  ^i(^)  and  ^(^)  =  (X'X)-iX'Clog(^e^^^^),  can  be  found  by  invoking 
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the  delta  method.  For  example, 

var(/iW)^L>(e^^*'Var(e(^))Z)(e^^*'^) 

and 

var(^(^))= 

{X'X)-'X'CD-'{Aij,(^'>)A{yeir{ij,(^^)))A'D-\Afi(^))C'X{X'X)-\ 

Evidently,  Lagrange's  method  of  undetermined  multipliers  provides  us 
with  a  convenient  procedure  for  maximum  likelihood  fitting  of  models  in  a 
very  general  class  of  parametric  models  for  multivariate  polytomous  data  with 
covariates  possible.  We  now  briefly  outline  the  steps  needed  to  perform  the 
iterations  of  (2.3.13). 

Computing  U.  The  first  thing  we  must  do  is  write  the  freedom  model  (2.3.4), 
which  can  easily  be  input  by  the  user,  as  a  constraint  model  (2.3.5).  Therefore, 
we  must  compute  a  full  column  rank  matrix  U  that  satisfies  U'X  =  0.  The 
method  we  use  to  find  U  is  attributed  to  Haber  (1985b). 

Using  the  notation  of  'mle.restraint',  let  X  be  a  full  column  rank  matrix 
of  dimension  qxr.  Let  u  =  ^  -  r  be  the  dimension  of  the  null  space  of  X'. 
Further  the  matrices  A  and  C  of  (2.3.4)  will  have  dimensions  mxs  and  qxm 
respectively.  The  relationship  between  these  dimension  variables  and  those 
used  in  sections  2.3,1  and  2.3.2  is  as  follows 

q  =  K{qi+q2) 
r=pi+p2 
m  =  K{mi  +7712). 
We  use  the  variables  q,  r,  and  m  for  notational  convenience. 
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Consider  the  matrix  U*  =  Iq- X{X'X)~'^X'.  This  qxq  matrix  is  of  rank 
u  =  q  ~r  and  satisfies  the  property 

U*'X  =  0. 

Let  W  denote  a  q  xu  matrix  with  random  elements.  Specifically, 

Wij   ~   Uniform(0,100),    i  =  l,...,q,    j  =  l,...,u. 

It  follows  that  the  matrix  W  is  of  full  column  rank  with  probability  one  and 
hence  that  the  qxu  matrix  U  =  U*W  is  of  full  column  rank  u  with  probability 
one.  But  the  matrix  U  satisfies 

U'X  =  W'U*'X  =  WO  =  0. 

Therefore,  at  least  with  probability  one,  we  have  found  a  full  column  rank 
matrix  U  that  satisfies  the  property  U'X  =  0.   Using  this  U,  we  are  able  to 
write  freedom  model  (2.3.4)  as  a  constraint  model  (2.3.5). 
Computing  h{^).   We  write  the  constraint  model  of  (2.3.5)  as 

iCeR':  h{C)  =  0,    e«'(efl^)  =  n'},  (2.3.15) 

where  the  constraint  function  h  is  defined  as 

Computing  g{$).  Notice  that  since  (Al),  (A2),  and  (A3)  hold,  the 
identifiability  constraints  present  in  the  product  multinomial  model  (2.3.4) 
can  be  accounted  for  explicitly.  It  will  follow  by  results  of  section  2.4,  that 
under  either  sampling  scheme — product-Poisson  or  product-multinomial — 
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the  maximum  likelihood  estimators  for  ^  and  A  can  be  found  by  solving  the 
equation 

a(«)  =  (!'-«'^+^«)>)=0,  (2.3.16) 

where  the  matrix  H  is  the  derivative  of  h'  with  respect  to  ^. 

Computing  H{^).  We  will  use  matrix  derivative  results  of  MacRae  (1974) 
to  find  the  matrix  of  derivatives  of  the  constraint  function  /i'(^). 

H{0  =  ^  =  §-^[\og{e^'A!)C<U,    e^'L'] 
.  =  [D{^^)A'D-\A^^)C'U,    D[e^)L']. 

The  equality  follows  upon  using  the  matrix  version  of  the  chain  rule.  Notice 
that 

|(log(e«'4')C"C^)  =  {?^)^^(\oi(eC  A')aU) 

and  that 

Computing  G{0).  The  iterative  scheme  (2.3.13)  used  to  solve  the  system 
of  equations  (2.3.16)  is  actually  a  slight  modification  of  the  Newton-Raphson 
algorithm.  It  is  a  modification  because  we  do  not  use  the  derivative  matrix 
G*  =  dg[6)/dd  to  adjust  at  each  iteration,  as  Haber  (1985a)  did,  but  rather  a 
simpler  matrix  G  that  is  related  to  G*  by  G*  =  G  +  Op{nl'^).  The  derivative 
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matrix  G*  can  be  computed  as  follows. 


The  matrix 


^^      d9>       [  d^>  '     d\' 


M«M-^(x,«,) 


ae       ae 

is  of  order  Op{nJ  )  when  it  is  evaluated  at  ^  =  vec(^,  A)  since 

and 

A  =  Op{nZ^'''). 

It  follows  that  the  matrix  G,  which  is  much  simpler  to  invert  than  G*,  can 
be  used  to  adjust  the  estimate  at  each  iteration. 

Computing  the  inverse  of  G.  Although  the  matrix  G  is  of  dimension 
{s  -\-  u  -\- 1)  X  (s  +  w  +  /),  which  may  be  very  laxge  in  practice,  its  inverse 
is  relatively  simple  to  calculate.  The  inverse  of  the  partitioned  matrix 

is  shown  by  Aitchison  and  Silvey  (1958)  to  have  form 

^-1  __fD-^-  D-'H{H'D-^H)-'H'D-^     -D-^H(H'D-^H)-'  \ 
\  -{H<D-^H)-^H'D-^  -{H'D-^H)-^       J' 

Therefore,  only  the  matrices  D  and  {H'D~^H),  which  are  of  dimensions 
s  X  s  and  (u  +  /)  x  {u  +  I),  need  to  be  inverted.    The  inverse  of  D  is  easily 
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calculated  since  D  is  a  diagonal  matrix  with  e^  on  the  diagonal.  The  inverse 
of  {H'D~'^H),  a  symmetric  positive  definite  matrix,  can  be  found  quite  easily, 
even  when  u  +  l,  the  number  of  constraints,  is  large.  It  should  be  pointed 
out  that  when  5,  the  total  number  of  cell  means,  is  large,  the  number  of 
constraints  u  +  /  may  be  large  and  on  the  same  order  as  s.  This  will  be  the 
case  for  parsimonious  models — those  models  with  many  constraints  relative 
to  number  of  model  parameters. 

One  could  choose  to  invert  the  matrix  G  a  limited  number  of  times  to 
mitigate  the  computational  burden.  In  fact,  in  their  1958  and  1960  papers, 
Aitchison  and  Silvey  advocate  an  iterative  method  whereby  the  inverse  of  G 
is  computed  only  two  times.  Once  at  the  initial  iteration  and  again  at  the 
final  iteration,  upon  convergence.  We  feel,  however,  that  in  this  special  case 
in  which  the  matrix  G  has  a  particularly  simple  form,  the  inverse  can  be 
computed  at  each  iteration.  Along  with  increased  computing  power,  there 
are  many  efficient  algorithms  for  inverting  large  symmetric  positive  definite 
matrices. 

2.4     Comparison  of  Product-Multinomial  and  Product-Poisson  Estimators 

We  begin  this  section  by  introducing  notation  for  a  product-Poisson 
random  vector. 

The  sxl  random  vector  Y  =  vec(yi, . . . ,  Yk)  is  said  to  be  product-Poisson 
if 

Yij  ~  ind  Poisson(e^'.'),    i  =  l,...,K,    j  =  l,...,R.  (2-4.1) 

Suppose  that  the  5  =  RK  log  means  {^,j}  satisfy  the  model  [0^^^]  where 
e^P  =  {CeR':  Clog(Ae^)  =  Xf3,  Le^  =  0} 
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or  equivalently,  for  appropriately  chosen  U, 

e^P  =  e[^^  =  {^eR':  U'Clog{Ae^)  =  0,  Le^  =  0}  (2.4.2) 

This  model  implies  all  the  sam.e  constraints  on  ^  as  the  product- 
multinomial  model  [0;,]  of  (2.3.5),  with  one  exception — the  identifiability 
constraints,  e^'(  ®  Ir)  =  n\  are  not  included. 

Denote  the  maximum  likelihood  estimators  computed  assuming  (2.4.1) 
and  (2.4.2)  by  ^^^^  and  P^^\  Similarly,  denote  the  maximum  likelihood 
estimators  computed  assuming  (2.3.1)  and  (2.3.5)  by  l^-'^)  and  ^(^). 

Recall  that  the  three  product-multinomial  model  assumptions  are 

(Al)  The  multinomial  response  model  can  be  specified  as  in  (2.3.3). 
That  is  the  model  parameter  space  can  be  represented  as 

Qx-iieR":  C^ log Aie^  =  X^^^C^ logA^e^  =  X^/S^, 
Le^  =  0,  e^'(®fl^)=n'}. 


where 


Ci  =  ©f^C'ij,        Cij  =  Cji,  is  gi  X  mi    i  =  1, 2 
Ai  =  ef  A,j,        Aij  =  Ail,  is  rrii  x  R,    i  =  l,2 
L  =  ®f-Lj,        Lj  =  Li  is  dx  R 
^  =  vec(^i,...,^ji:),        and^fcisiixl 
Xi      is  Kqi  X  Pi  of  full  rank  pi,    i  =  1,  2 
n   is  the  iiT  x  1  vector  of  multinomial  indices 
s  =  RK,    the  total  number  of  cells. 
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(A2)     Either  Ci=I^.K  or  Ci{  ®lm,)  =  0,    i  =  l,2, 
and 

(A3)     UCi  =  I^.K  then     M{Xi)  D  M{®lm,). 

The  following  theorem  states  that  the  maximum  likelihood  estimators  for 
^  and  hence  (3  are  the  same  under  the  product-multinomial  sampling  scheme 
of  (2.3.1)  and  the  product-Poisson  sampling  scheme  of  (2.4.1)  provided  that 
the  three  assumptions  (Al),  (A2),  and  (A3)  hold. 

Theorem  2.4.1    If  the  model  (2.3.4)  satisfies  assumptions  (Al),  (A2),  and 
(A3),  then 

liP)  =  liM)       and      p{P)=^pW 

That  is,  the  maxim,um  likelihood  estivfiators  of  j3  and  ^  are  the  same  under 
both  sampling  schem,es — product-Poisson  (2.4-1)  and  product-multinomial 
(2.3.1). 

Proof:    Under  the  product  Poisson  assumption  of  (2.4.1)  and  (2.4.2),  the 
kernel  of  the  log  likelihood  is 

^(^He;y)  =  y'e-e^'i.. 

Therefore,  letting  9  =  vec(^,  A),  the  corresponding  Lagrangian  objective 
function  is 

and  so  to  find  the  maximum  (Poisson)  likelihood  estimator  ^(^)  =  {^^^\  A^-^)) 
we  must  solve  the  system  of  equations 

em=(y-^'''-^mnx^^A=o.  (2.4.3) 

The  conclusion  of  the  theorem  now  follows,  since  the  equations  (2.3.8)  of 
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Theorem  2.3.1  and  (2.4.3)  yield  exactly  the  same  solutions  and 

^(^)  =  {X'X)-'X'C\og{Ae^^'^)  =  {X'X)-'X'C\og{Ae^'''')  =  ^(^\ 


As  a  corollary  to  Theorem  2.4.1  we  have 

Corollary  2.4.1  Provided  the  assumptions  of  Theorem  2.4.1  hold,  the 
estimated  undeterm,ined  multipliers  are  invariant  with  respect  to  sampling 
scheme,  i.e. 


Proof:  The  proof  follows  immediately  upon  noting  that  equations  (2.3.8) 
and  (2.4.3)  yield  exactly  the  same  solutions.  H 

A  remark  is  in  order.  Basically,  Theorem  2.4.1  enables  us  to  conclude 
that  the  sufficient  and  necessary  condition  of  Birch  (1963)  holds.  These 
conditions  are  that  the  model  be  specified  so  that  the  Poisson  ML  estimators 
necessarily  satisfy  the  identifiability  constraints  that  are  required  for  the 
multinomial  model. 

We  now  explore  the  asymptotic  behavior  of  the  (Poisson)  ML  estimator 
QiP)  =  vec(^(^),  A(-^)).  For  the  product-Poisson  assumptions  (2.4.1)  and 
(2.4.2),  we  can  obtain  the  asymptotic  distribution  of  $^^^  by  formally  replacing 
the  n*  =  min{nj}  by  /i*  =  min{e^«v}  and  using  the  same  arguments  as  those 
used  to  derive  the  asymptotic  distribution  of  $^^^. 

J0rgenson  (1989)  discusses  limiting  distributions  for  Poisson  random 
variables  as  the  mean  parameters,  or  equivalently  //*,  go  to  infinity.  In  this 
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case. 


'J 


9{e.^Y)=[Y-/') 


has   an   asymptotic  normal  distribution  with  mean  zero  and  asymptotic 
covariance 


(DM     0\ 
[     Q         O)- 

Using  arguments  similar  to  those  used  in  the  multinomial  case,  it  follows  that 

We  conclude,  as  in  the  product  multinomial  case,  that  6^^^  -  9o  has  an 
asymptotic  normal  distribution  with  mean  zero  and  asymptotic  covariance 

(Difio)     -H\-'(D{n,)     0\(D{f,o)     -H\-' 
\  -H'        0   yl      \,     0         0)\-H>        0   J      ' 

But,  this  can  again  be  simplified  as  it  was  in  the  multinomial  case.  It  can  be 
shown  that  the  asymptotic  covariance  can  be  rewritten  as 

where  D  =  L>(/io)  =  D{e^o)  and  H  =  iJ(^o). 

Comparison  of  the  Asymptotic  Distributions.  Provided  assumptions  (Al), 
(A2),  and  (A3)  hold,  both  ^(^)  -  ^o  and  ^(^)  -  do  have  asymptotic  normal 
distributions  with  zero  means  and  respective  covariances  given  in  (2.4.4)  and 
(2.3.12).  Therefore,  we  have  the  following  interesting  results. 

Result  1.    The  asymptotic  covariances  of  6^^^  and  6^^^  are  related  by 


1r1'. 


var(^W)  =  var(^(^))-     ®"l7^     0  (2.4.5) 
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Result  2.    The  asymptotic  distributions  of  A(^)  and  A(^)  are  identical  and 
it  follows  that  the  Lagrange  multiplier  statistic  which  has  form 

LM  =  A'(var(A))-^A  =  X'{H'D-'H)X 

is  invariant  with  respect  to  the  sampling  scheme. 

Result  3. 

-(P)^(P)' 
var(/iW)  =  var(/i(^))  -  ®f^'    ^'  (2.4.6) 


Tli 


Result  4. 


var(^(^))  =  var(^(^))  -  A  (2.4.7) 


where 


f^    (e^,  ®^)C'x(X'X)-\ 

and  is  nonnegative  definite. 

The  notation  var(-)  used  in  these  results  denotes  the  asymptotic  variance. 
This  is  important  since  the  finite  sample  variances  may  not  even  exist. 

The  proofs  for  Results  3  and  4  are  straightforward.  Basically,  they 
involve  using  the  delta  method  and  equation  (2.4.5).  The  interested  reader 
will  find  an  outline  of  the  proofs  in  Appendix  A. 

In  practice,  it  is  of  particular  interest  to  evaluate  the  matrix  A  of  equation 
(2.4.7).  Often,  for  convenience,  the  models  are  fit  assuming  the  vector  Y 
is  product  Poisson  and  then  inferences  based  on  the  maximum  likelihood 
estimates  are  made  assuming  that  they  are  invariant  with  respect  to  the 
sampling  assumption.    Birch  (1963)  and  Palmgren  (1981)  derive  rules  for 
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when  these  inferences,  based  on  the  two  different  sampHng  assumptions,  will 
be  equivalent.  However,  they  assume  that  the  model  is  of  a  simple  loglinear 
form.  That  is,  the  Poisson  model  is  assumed  to  have  form 


ex  =  {(eR':C  =  Xfi}. 


We  will  use  the  results  of  this  section  to  derive  more  general  rules  for  when 
the  two  inferences  will  be  equal.  As  a  special  case  of  these  results,  we  will 
arrive  at  the  Birch  and  Palmgren  results. 

The  following  lemma  will  enable  us  to  rewrite  A  of  (2.4.7)  in  still  a 
simpler  form. 

Lemma  2.4.1  Let  Z  =  [Zi,..  .,Zk]  be  an  r  x  K  matrix  of  full  rank  K. 
Suppose  that  X  =  [Xi , . . . ,  Xp]  is  an  r  x  p  {r  >  p  >  K)  matrix  of  full  rank  p 
such  that  M{X)  D  m{Z),  i.e.  the  range  space  of  X  contains  the  range  space 
of  Z.  Denote  the  T  (K  <T  <p)  columns  of  X  that  span  a  space  that  contains 
M{Z)  by  {X^^,. .  .,X^j,}.  Without  loss  of  generality,  suppose  that  the  set  of 
vectors  {X^^,. .  .^X^j-}  is  a  minimal  spanning  subset,  i.e.  the  spanning  set 
of  any  r  <  T  of  these  vectors  does  not  contain  the  range  space  of  Z.  We 
conclude  that 

3W  e  R^""^  3  {X'XyX'Z  =  JW, 

where  the  p  x  T  matrix  J  =  [e,.^, . .  .,e,/j,]  and  e^i  is  the  p  x  1  vector 
(0, . . . ,  0, 1, 0, . . . ,  0)'  with  the  '1 '  in  the  v\^  position. 
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Proof:    Let  X*  =  [X^^,. .  .,X„^].    Now,  by  assumption,  M(X*)  D  M{Z). 
Hence,  there  must  exist  a  matrix  W  e  BJ^^  3  Z  =  X^W.  Therefore, 

{X'XyX'Z  =  {X'X)-^X'X,W  =  {X'X)-\X'X,)W  =  JW 

where  J  =  {X'X)~^[X'X^)  is  as  stated  in  the  conclusion  of  the  lemma.        m 
Before  stating  the  next  important  theorem,  let  us  write  A  in  another 
way.  Assuming  that  (Al)  holds,  A  can  be  written  as 

A=(;i2;     A^')  (2.4.8) 

where 

A^^-  =  {X',X,)-^XIQ{  e  i^)(  ©  ^JC',Xj{X',X,)-\ 

Now,  if  Cj  is  a  contrast  matrix,  by  assumption  (A2),  we  can  write 

{X'iXi)-'X'iCi{  8  ^)  =  0  =  J(')Vr('),  (2.4.9) 

where  J(')  can  arbitrarily  be  chosen  to  be  equal  to  X[  and  so  W(*)  =  0.  On 
the  other  hand,  if  Ci  =  I^.k  then  we  have  by  (A3)  that  M{Xi)  D  A^(®l„i.). 
Therefore,  we  can  invoke  the  result  of  Lemma  2.4.1  by  setting  Z  =  ®^ 


Since  M{Xi)  D  A^(©lm,)  =  M{Z),  the  conditions  for  the  lemma  are  satisfied. 
Let  X,>  =  [X.  (i),...,X.  (i)]  be  the  rriiK  xTi  {K  <  Ti  <  pi)  submatrix  of  Xi 
that  has  columns  that  form  a  minimal  spanning  subset  for  M{Z)  =  Al(®^p-). 
By  Lemma  2.4.1, 

3Ty«  e  i?^'^-^  3  {X'iXi)-^Xl{  ©  ^)  =  JWK^W.  (2.4.10) 

Here,  J(*)  =  [e  (,),...,  e  (i)],  where  the  Tj  elementary  vectors  correspond  to  the 
columns  {X.  (;),...,  X.  a)}  of  X^  that  form  a  minimal  spanning  subset  for  the 
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range  space  of  ©Im,-,  i.e.  the  T^  columns  span  a  set  that  contains  the  range 
space  of  ©1^.  and  any  smaller  set  of  columns  will  not  span  a  set  containing 
the  range  space  of  ®1,„^  . 

It  follows  that  the  matrices  A'-'  of  (2.4.8)  can  be  written  as 

j^ij  ^  j{i)w^i)w>U)j'U)  (2.4.11) 

where 

J^'^  =  \  ^  "1  "i^^  /•  (2.4.12) 

[  X,' ,  otherwise 

and 

p^W^/tyW,     if^i^V  (2.4.13) 

[  0,  otherwise.  ^  ^ 

We  now  state  a  theorem  of  substantive  importance. 

Theorem  2.4.2    Suppose  that  assumptions  (Al),  (A 2),  and  (A3)  hold.  For 

r  =  1,2,   if  Cr  is  the  identity  matrix  then  let  {u[''\. .  .,u^^}   be  the  set  of 

indices  that  index  those  columns  of  Xr  that  form  a  minimal  spanning  subset 

for  M{®lm^).    Then  it  follows  that  the  relationship  between  the  asymptotic 

variances  of  the  two  estimators  ^(^)  and  ^^^^  is 


var(^(«))  =  va.(^(''))-(A;;     A^^), 


where  the  pi  x  pj  matrix  A'-'  is  a  zero  matrix  whenever  at  least  one  of  Ci  or 
Cj  is  a  contrast  or  zero  matrix.  Otherwise,  if  both  Ci  and  Cj  are  identity 
matrices  then 

A^>  =  0,    if  (fc,  /)  ^  {u^\  ...,  r.W}  X  {u[^\  . . . ,  r.(^:)}. 
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Proof:  Since  (Al),  (A2),  and  (A3)  hold,  we  can  rewrite  A'^  as  in  (2.4.11). 
Now,  if  either  C,-  or  Cj  are  contrast  or  zero  matrices,  it  is  obvious  by  (2.4.9) 
that  A*-J  will  have  zero  components,  as  stated  in  the  theorem,  since  at  least 
one  of  W^(')  or  VF(j)  will  be  a  zero  matrix.  On  the  other  hand,  if  both  Cj  and 
Cj  are  identity  matrices,  then  A'-'  can  be  rewritten  as  in  (2.4.11)  where 

1  Tj 

and  the  matrices  W^)  and  W^i)  are  elements  of  iJ^'^-^  and  R^i""^ .  Hence, 


K'^  =  \eu,,...,e.,]W^^W'^i 


U) 


/e'      \ 


e' 

J') 


=  [em,...,em]W^'^' 


/e'      \ 


where  W^^  =  PFCO^T'O)  is  some  TJ  xT,-  matrix.  Now,  since  {Cj,}  are  elementary 
vectors,  we  have  that  if 


then  the  component  A^-'j  =  0.  Otherwise,  if  (fc,/)  is  a  member  of  this  set,  it 
must  be  that  A'^i  is  one  of  the  elements  of  the  matrix  W*^.  This  completes 
the  proof.  ^ 

The  next  two  corollaries  follow  immediately  from  Theorem  2.4.1. 
Corollary  2.4.2    //  both  Ci  and  C^  are  contrast  matrices  then 


var(;5(^))  =  var(^(^)). 
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Proof:    Since  both  Cj  and  C2  are  contrast  matrices  it  follows  that  W(^) 

and  VF(2)  are  zero  matrices.  Therefore,  the  matrices  A'-?  of  the  theorem  are 

zero  matrices.  _ 

Corollary  2.4.3    Let  C2  =  0,^2  =  0,  and  C^  =  Ai  =  I„  so  that  the  model 

(2.3.4)  hecomes 

Qh  =  {CeR':C  =  X/3,  e^'(  ©  1^)  =  n'}, 

i.e.  a  simple  loglinear  model  with  K  subpopulations.  Let  {vi, . . . ,  i/y}  be  the  set 
of  indices  that  index  the  columns  of  X  that  form  a  minimal  spanning  subset 
for  Al(©f  lii).   Then 

var(^(^))  =  var(;9(^))  -  A, 
where  the  elements  of  A  are  such  that 


Proof:  The  proof  is  an  immediate  consequence  of  the  theorem  upon 
identifying  A"  of  the  theorem  with  A  of  the  corollary.  The  other  matrices 
A^2^  A^i,  and  A^^  will  be  zero  since  C2  =  0.  ^ 

Corollary  2.4.3  is  of  practical  importance  and  is  essentially  the  result 
shown  by  Palmgren  (1981).  In  particular,  if  we  parameterize  the  model  in 
such  a  way  so  that  there  is  a  parameter  included  for  each  of  the  K  independent 
multinomials  (or  K  covariate  levels),  then  the  K  columns  of  X  corresponding 
to  these  K  'fixed  by  design'  parameters  will  form  a  basis  (and  hence  a  minimal 
spanning  subset)  for  M{@^1r).  Therefore,  ii  j3i  and  /3j  are  not  one  of  the 
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K  parameters  fixed  by  design,  then  coY{fi\^\'^pf^^)  =  cow(^P,^^^).    We 
will  illustrate  the  utiHty  of  the  above  results  in  the  next  chapter  of  this 
dissertation. 

The  next  section  considers  issues  that  may  arise  when  computing  the 
model  degrees  of  freedom.  It  also  states  some  other  miscellaneous  results 
with  regard  to  the  Lagrange  multiplier  statistic. 

2.5     Miscellaneous  Results 

We  begin  this  section  by  addressing  practical  issues  that  may  arise  during 
nonstandard  model  fitting.  Specifically,  we  will  consider  computing  the  model 
and  distance  (or  residual)  degrees  of  freedom. 

Computing  model  and  distance  degrees  of  freedom.  Assuming  the  model 
[^h]  of  (2.3.5)  is  well  defined,  i.e.  the  it  +  Z  +  iT  constraints  are  nonredundant, 
we  can  compute  the  model  degrees  of  freedom  as  in  section  2.2.  In  that 
section,  we  defined  the  model  degrees  of  freedom  as  the  number  of  model 
parameters  minus  the  number  of  independent  constraints  impHed  by  the 
model.  Notice  that  in  this  application  we  have  an  additional  I  linear 
constraints.  The  I  constraints  were  not  present  in  section  2.2.  It  follows 
that  the  model  degrees  of  freedom  for  [0;,]  is 

df[Qh]  =  s-{u  +  l  +  K)  (2.5.1) 

where  s  is  the  number  of  cell  means,  u  is  the  dimension  of  the  null  space  of  X' , 
I  is  the  number  of  linear  constraints,  and  K  is  the  number  of  identifiabihty 
constraints. 
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To  measure  model  goodness  of  fit,  we  can  consider  estimating  some 
hypothetical  distance  between  model  [0/,]  and  the  saturated  model  {u  =  l  =  0) 
[0].  This  distance,  denoted  ^[0^;©]  has  degrees  of  freedom 

dme^;e])  =  df[e]-df[e^] 

=  {s-K)-{s-{u  +  l  +  K))  (2.5.2) 

=  u  +  l. 
Notice  that,  had  we  considered  the  product  Poisson  model  (2.4.2),  the 
distance  degrees  of  freedom  would  be 

which  is  identical  to  the  product  multinomial  distance  degrees  of  freedom  of 
(2.5.2). 

We  have  assumed  that  the  u  +  l  +  K  constraints  are  nonredundant,  i.e. 
each  constraint  is  not  implied  by  the  other  constraints.  This  may  not  always 
be  the  case.  To  illustrate,  consider  the  model  specification  for  example  3  of 
section  2.2.2.  The  model  [Qmh]  imphes  that  the  two  marginal  distributions 
are  equal.  We  stated  at  the  end  of  that  example  that  the  additional  constraint 
7^2+  -  7r+2  =  0  was  redundant.  This  can  be  seen  since 

7r2+  -  7r+2  =  7r2i  -  7ri2  =  -{"^1+  -  tt+j)  =  0 

That  is,  the  constraints  of  model  [0Mff]  imply  that  tt^^  -  7r+2  equals  zero. 
Had  we  blindly  added  this  constraint,  we  may  have  incorrectly  calculated 
the  model  degrees  of  freedom  as  1  and  the  distance  degrees  of  freedom  as  2, 
Therefore,  we  must  be  very  careful  to  have  a  set  of  nonredundant  constraints 
when  computing  degrees  of  freedom. 
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In  practice,  when  models  are  more  complicated,  it  may  be  difficult  to  as- 
certain whether  or  not  the  model  constraints  are  nonredundant.  Fortunately, 
there  are  two  very  useful  results  that  help  in  this  regard. 

The  first  result  is  that  when  the  constraints  are  redundant,  the  matrix 
{H'D~^H)  evaluated  at  some  point  in  0;,  is  of  less  than  full  rank  and  is  not 
invertible.  Therefore,  in  practice,  if  the  algorithm  (2.3.13)  does  not  converge 
due  to  G  being  singular,  it  may  be  due  to  redundant  constraints,  i.e.  an  ill- 
defined  model.  The  user  should  investigate  and  possibly  respecify  the  model 
should  this  occur.  A  caveat  is  that  due  to  computational  roundoff  error,  a 
singularity  may  not  occur  even  when  the  model  is  ill  defined  because  the 
iterate  estimates,  including  the  final  estimate,  may  not  strictly  lie  in  0/,.  The 
next  result  may  mitigate  this  problem. 

A  result  that  is  useful  in  practice  is  that  a  necessary  condition  for  the 
constraints  to  be  nonredundant  or  equivalently  for  the  model  to  be  well 
defined,  is  that  the  Lagrange  multiplier  statistic  be  invariant  to  choice  of 
C/,  a  matrix  with  columns  spanning  the  null  space  of  X' .  Evidently,  if  the 
user  fits  the  model  several  times,  each  time  using  a  different  ''V  matrix,  and 
the  Lagrange  multiplier  statistic  varies  (more  so  than  can  be  explained  by 
roundoff  error),  then  it  must  be  that  the  model  is  ill  defined. 

Formally,  this  necessary  condition  can  be  stated  as 
Theorem  2.5.1    Let  Ui  and  U2    {Ui  ^  U2)    be  any  two  full  column  rank 
matrices  satisfying  U[X  —  Q,    i  =  1,2.  Denote  the  Lagrange  multiplier  statistic 
evaluated  using  Ui  by  LM{Ui).  If  the  matrix 


^,  =  MQ^|[log(e^'A')C'C/„    e^'L'] 
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x's such  that  [Hi,  e^]  is  of  full  column  rank,  i  —  1,2,  and  hence  the  models  well 
defined,  then 

LM{U^)  =  LM{U2), 

i.e.   the  value  of  the  Lagrange  multiplier  statistic  is  invariant  with  respect  to 
choice  of  U. 

Proof:  Denote  the  model  specified  in  terms  of  Ui  by  [©^J,  z  =  1,2.  By 
the  definition  of  Ui  we  know  that  the  constraints  imphed  by  [0/,J  and  [Qhi] 
are  equivalent.  Hence,  the  solution  ^  to  (2.3.8),  or  equivalently  (2.4.3),  under 
either  model  is  the  same.  Thus,  in  view  of  the  first  set  of  equations  in  (2.3.8), 
any  solution  vec(^,  Aj)  under  model  [©/,,.]  must  satisfy 

-iy-e^)=    HiCOk    i  =  h2.  (2.5.3) 

Notice  that  since  Ui^Ui,  we  have  that  i/i(0  7^  -^2(0  ^^^  ^y  (2.5.3)  Aj  ^  A2. 
Now,  (2.5.3)  implies  that 

H^iiyX^   =  i?2(0^-  (2.5.4) 

Also,  since  Hi(^)  is  assumed  to  be  of  full  column  rank,  the  variance  of  A^, 

var(A,)  =  {H'i{OD-\e^)HiCOr'  (2.5.5) 

exists.  Therefore,  the  Lagrange  multiplier  statistics  LM{Ui),  which  have  form 

A;[var(Ai)]-^Ai,    i  =  l,2  (2.5.6) 
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exist.  Finally,  by  (2.5.4)-(2.5.6),  it  follows  that 

LM(C/a)  =  Ai[var(Ai)]-Ui 

=  A'2[var(A2)]-U2 

=  LM{U2). 
This  completes  the  proof.  ^ 

The  final  result  of  this  section  states  that  the  Lagrange  multiplier 

statistic  is  exactly  the  same  as  the  Pearson  chi-squared  statistic  whenever  the 

random  vector  Y  is  product-Poisson  or  product-multinomial  and  the  model 

satisfies  assumptions  (Al),  (A2),  and  (A3). 

Theorem  2.5.2  Assume  that  the  product-multinomial  model  satisfies 
assumptions  (Al),  (A2),  and  (A3).  Let  X^  denote  the  Pearson  chi-squared 
statistic,  i.e. 

X'  =  {y-fiyD-\i,){y-fi) 

where  fi  is  the  ML  estimator  under  either  of  the  sam,pling  schemes — product- 
multinomial  or  product-Poisson.  It  follows  that  the  Lagrange  multiplier 
statistic  LM  is  equivalent  to  X^.   That  is, 

LM  =  X\ 

Proof:    By  equations  (2.5.3),  (2.5.5),  and  (2.5.6)  of  the  previous  theorem's 
proof  and  the  fact  that  e^  =  fi,  we  have  that 

LM  =  {y-fLyD-'my-fi)  =  X' 

This  is  what  we  set  out  to  show.  ■ 
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2.6     Discussion 

In  this  chapter,  we  discussed  in  some  detail  issues  related  to  parametric 
modeling.  In  particular,  we  followed  the  lead  of  Aitchison  and  Silvey  (1958, 
1960)  and  Silvey  (1959)  and  described  two  ways  of  specifying  models — using 
constraint  equations  and  using  freedom  equations.  In  section  2.2,  distance 
measures  for  quantifying  how  fax  apart  two  models  are,  relative  to  how  close 
they  are  to  holding,  were  discussed.  In  particular,  the  power-divergence 
measures  (Read  and  Cressie,  1988)  were  used  when  the  parameter  spaces  were 
subsets  of  an  (5  -  l)-dimensional  simplex.  Estimates  of  these  distances  were 
developed  based  on  very  intuitive  notions.  Also,  a  geometric  interpretation 
of  model  and  residual  (or  distance)  degrees  of  freedom  was  given. 

In  section  2.3,  we  described  a  general  class  of  multivariate  polytomous 
(categorical)  response  data  models.  The  class  of  models,  which  satisfy 
assumptions  (Al),  (A2),  and  (A3),  were  shown  to  satisfy  the  necessary  and 
sufficient  conditions  of  Birch  (1963)  so  that  the  models  could  be  fitted  using 
either  the  product-Poisson  or  product-multinomial  sampling  assumption. 

An  ML  fitting  method  was  developed,  using  results  of  Aitchison  and  Sil- 
vey (1958,  1960)  and  Haber  (1985a,  1985b).  The  algorithm  used  Lagrangian 
undetermined  multipliers  in  conjunction  with  a  modified  Newton-Raphson 
iterative  scheme.  The  modification,  which  simplifies  the  method  of  Haber 
(1985a),  is  to  use  a  simpler  matrix  than  the  Hessian  matrix.  We  replace 
the  Hessian  matrix  (of  the  Lagrangian  objective  function)  by  its  dominant 
part,  which  turns  out  to  be  easily  inverted.  Because  the  matrices  used  in  the 
algorithm  proposed  in  this  chapter  are  very  large  and  must  be  inverted,  this 
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modification  is  a  very  important  one.  A  FORTRAN  program  'mle.restraint' 
has  been  written  by  the  author  to  implement  this  modified  algorithm. 

The  asymptotic  behavior  of  the  ML  estimators  computed  under  the  two 
sampling  schemes — product-Poisson  and  product-multinomial — was  investi- 
gated. The  method  for  deriving  the  asymptotic  distributions  represents  a 
modification  to  the  technique  of  Aitchison  and  Silvey  (1958).  A  comparison  of 
the  limiting  distributions  of  the  two  estimators  was  made  in  section  2.4.  Some 
very  interesting  results  were  obtained  by  studying  the  asymptotic  behavior 
in  the  constraint  equation  setting.  In  particular,  Theorem  2.4.2  represents 
a  generalization  of  the  results  of  Palmgren  (1981).  The  theorem  provides  a 
method  for  determining  when  the  inferences  about  the  freedom  parameters 
of  a  generalized  loglinear  model  of  the  form  C  log  A^  =  X/3  will  be  invariant 
with  respect  to  the  sampling  assumption.  Palmgren  (1981)  developed  some 
similar  results  for  the  special  case  when  the  freedom  parameters  are  part  of 
a  loglinear  model. 

It  is  important  to  note  that  the  asymptotic  results  are  only  valid  if 
the  number  of  populations  K  is  considered  fixed  and  the  expected  counts 
all  get  large  at  approximately  the  same  rate.  In  particular,  the  asymptotic 
arguments  do  not  hold  when  the  covariates  are  continuous,  since  the  number 
of  populations  (levels  of  the  covariates)  can  theoretically  run  off  to  infinity. 
The  reason  the  arguments  do  not  hold  is  that  when  we  use  the  method  of 
Aitchison  and  Silvey  (1958)  it  is  required  that  the  vector  nZ^  ^M|;y)  converge 
in  probability  to  zero  as  the  total  number  of  observations  gets  large.  This  is 
the  case  only  when  n,  =  min{ni, . .  .^Uk}  goes  to  infinity.  This  drawback 
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could  prove  to  be  temporary.  It  seems  reasonable  to  assume  in  many  cases, 
that  as  long  as  the  'information'  about  each  parameter  is  increasing  without 
bound,  the  estimators  will  be  consistent  and  asymptotically  normally  dis- 
tributed. For  example,  consider  the  logistic  regression  model  with  continuous 
covariates.  Although  the  njfc's  may  all  be  1,  the  ML  estimators  of  the 
regression  parameters  are  often  consistent  and  asymptotically  normal. 

Section  2.5  outlines  some  miscellaneous  results.  One  result  that  is 
important  to  the  practicing  statistician,  is  that  the  Lagrange  multiplier 
statistic  is  shown  to  be  invariant  with  respect  to  choice  of  the  matrix  U 
(of  U'ClogAfj,  =  0)  as  long  as  the  model  is  well  defined.  An  important 
implication  of  this  result  is  that  if  one  fits  the  model  several  times,  each 
time  using  a  different  't/'  matrix,  and  the  Lagrange  multiplier  statistics 
vary  more  so  than  can  be  explained  by  roundoff,  then  it  could  be  that  the 
model  is  not  well  defined.  Another  interesting  result  is  that  the  Lagrange 
multiplier  statistic  is  simply  the  Pearson  chi-squared  statistic  X^  whenever 
the  assumptions  (Al),  (A2),  and  (A3)  are  satisfied. 

Theoretically  the  ML  fitting  algorithm  will  work  for  any  size  problem. 
Practically,  however,  the  algorithm  is  certainly  not  a  model  fitting  panacea. 
The  number  of  parameters  that  must  be  estimated  gets  very  large,  very  fast. 
Consider  the  case  where  7  raters  rate  the  same  set  of  objects  on  a  5  point 
scale.  Even  without  covariates,  the  number  of  cell  probabilities  that  must  be 
estimated  is  5"^  =  78,125.  It  seems  the  ML  fitting  method  developed  in  this 
chapter  is,  at  least  for  now,  useful  for  moderate  size  problems  only.  It  can  be 
used  to  analyze  longitudinal  categorical  response  data  when  the  number 
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of  measurements  taken  on  each  subject  is  somewhere  in  the  neighborhood  of 
2  to  6.  This  is  not  to  take  away  from  the  utiUty  of  this  chapter's  algorithm, 
but  rather  to  indicate  its  breadth  of  appUcation.    In  time,  with  increasing 
computer  efficiency,  much  larger  data  sets  may  be  fitted  using  this  algorithm. 


CHAPTER  3 

SIMULTANEOUSLY  MODELING  THE  JOINT  AND  MARGINAL 

DISTRIBUTIONS  OF  MULTIVARIATE  POLYTOMOUS 

RESPONSE  VECTORS 


3.1     Introduction 

Often  times,  when  given  an  opportunity  to  analyze  multivariate  response 
data,  the  investigator  may  wish  to  describe  both  the  joint  and  marginal 
distributions  simultaneously.  We  consider  a  broad  class  of  models  which 
imply  structure  on  both  the  joint  and  marginal  distributions  of  multivariate 
polytomous  response  vectors.  To  illustrate  the  need  for  such  models,  we 
consider  several  settings  where  these  models  would  be  useful.  For  example, 
when  the  multivariate  responses  represent  repeated  measures  of  the  same 
categorical  response  across  time,  one  may  be  interested  in  how  the  marginal 
distributions  are  changing  across  time  and  how  strongly  the  responses  are 
associated.  The  simultaneous  investigation  of  both  joint  and  marginal 
distributions  is  not  restricted  to  the  longitudinal  data  setting.  Other  examples 
include  the  analysis  of  rater  agreement,  cross-over,  and  social  mobility  data. 
The  common  thread  tying  all  of  these  data  types  together  is  that  the  sampling 
scheme  is  such  that  the  different  responses  are  correlated.  In  longitudinal 
studies  the  same  subject  responds  on  several  occasions.  In  rater  agreement 
studies,  raters  rate  the  same  objects.  In  two-period  cross-over  studies,  one 
group  of  subjects  receive  the  two  treatments  in  one  order  and  the  other  group 
receive  them  in  the  other  order.  In  social  mobility  studies,  the  socio-economic 
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status  of  a  father-son  pair  is  recorded.    When  the  responses  are  positively 
correlated,  these  designs  result  in  increased  power  for  detecting  differences 
between  the  marginal  distributions  (Laird,  1991;    Zeger,  1988). 

This  chapter  considers  the  modeling  of  multivariate  categorical  responses 
in  which  the  same  response  scale  is  used  for  each  response.  The  classes 
of  models  used  in  this  chapter  are  of  the  form  considered  in  Chapter  2  of 
this  dissertation  and  hence  are  readily  fit  using  the  ML  methods  of  that 
chapter.  In  section  3.2,  we  give  several  examples  that  may  be  analyzed  by 
simultaneously  modeling  the  joint  and  marginal  distributions.  We  introduce 
the  classes  of  simultaneous  Joint-Marginal  models  in  section  3.3.  Several 
models  are  fitted  to  the  data  sets  of  section  3.2. 

3.2     Product-Multinomial  Sampling  Model 

Initially,  we  assume  that  a  random  sample  of  rik  subjects  is  taken  from 
population  k,  k  =  1,...,K.  The  number  of  populations,  or  covariate  profiles, 
K  is  considered  to  be  some  fixed  integer.  The  subscript  k  is  allowed  to  be 
compound,  i.e.  the  subscript  k  is  allowed  to  represent  a  vector  of  subscripts 
such  as 

K  ^=  \ki,  K25  ■  •  •  5  kif). 

Suppose  that  there  are  T  categorical  responses  V(^), . . . ,  V'(^)  of  interest 
and  that  each  response  is  measured  on  the  same  response  scale.  Let 
Vk  =  {V^  \...,V^  ^y  be  the  random  vector  of  responses  for  population  k 
and  Vjtu,  u  =  l,...,njfc  be  the  Uk  independent  and  identically  distributed 
copies  of  V^.,  where  Vf.^  denotes  the  response  profile  for  the  u*'*  randomly 
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chosen  person  within  population  k.  Notationally  we  have, 

Vku   ~  i.i.d.  14,     u-l,...,nk 

For  our  purposes  we  can  assume  that  each  response  takes  on  values  in 
{l,2,...,c?}  with  probability  one.  Denote  the  probability  that  a  randomly 
selected  subject  from  population  k  has  response  profile  i  =  {ii,...,  ij-)'  by  Tr^fc, 
i.e. 

where  ie  {1,..  .,d}  x  •••  x  {1,. .  .,d}. 

The  joint  distribution  of  V^  =  {V^^\...,  F^^^)'  is  specified  as  {TTik}.  The 
marginal  distributions  of  V^.  will  be  denoted  by  {0i(<;  k)},    t  =  l,...,T,  where 

Our  objective  is  to  model  simultaneously  the  K  joint  distributions 

{TTiJfc},       k  =  l,...,K 
and  the  KT  marginal  distributions 

To  help  the  reader  better  understand  the  notation,  we  consider  the  one 
population  bivariate  case.  When  T  =  2,  the  response  profiles  can  be  denoted 
by  i  =  (11,^2)  =  {hj)i  where  i  =  l,...,d  and  j  =  l,...,d.  Since  there  is 
just  one  population  (or  covariate  profile)  the  subscript  k  is  always  1  and  is 
therefore  dropped.  It  follows  that  {ttjj}  is  the  joint  distribution  of  ( V(^),  y(2))' 
and  {(f>i{t)},    t  =  1,2  are  the  two  marginal  distributions.  That  is, 

7rij=P{V(')=i,V(')=j),    i=l,...,d,    j  =  l,...,d 
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and 

Ut)  =  { 

[7r+i  =  P{V(')=i),     at  =  2 
for  i  =  1,2,.  ..,d. 

Now  for  each  population  k,    consider  the  cF  x  1   random  vector  of 
indicators 

**  =  [^(Vk=i,),  •  •  • ,  hvk=iiT)\' 

Notice  that  no  information  about  the  V^.  is  lost  since  ^jt  is  a  one-to-one 
function  of  V^..  Also, 

^^fe   ~  ind.  Mult(l,  {TTik}),    k  =  l,...,K 

Therefore,  since  we  have  randomly  sampled  njfc  subjects  from  each  of  the  K 
populations,  we  have  that  for  given  k 

^;tl,^fc2,---,*fcn,     -    i.i.d.  Mult(l,    {TTik}) 

and  hence  the  vector 

Yk  =  Y,^ku  -  Mult(n„  {TTjfc}) 


u=l 


is  sufficient  for  the  family  of  distributions  {Tri^}  and  {(f>i{t]  k)}. 

By  independence  across  populations,  the  vector  yec{Yi,Y2,. .  .,Yk)  is 
sufficient  for  the  joint  and  marginal  distributions  of  yec{Vi,V2,. .  .,Vk). 
Further,  the  random  vector  vec(yi,F2,  •  ••■,Yk)  is  product-multinomial,  i.e. 

Yk  =  iY,k,...,YRk)'  -  indMult(n,,  {tt,,}),    k  =  l,...,K 

where  1, . . .  ,^  represent  the  R  =  (P"  different  response  profiles. 
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Evidently,  Yik  represents  the  number  of  randomly  selected  subjects  from 
population  k  who  have  response  profile  i.  That  is,  the  {Yik}  represent  counts 
resulting  from  a  cross-classification  of  N  =  Y,k=i  '^fc  subjects  on  T  response 
variables  and  a  population  variable.  The  data  can  be  displayed  in  a.  d!^  x  K 
contingency  table.  By  convention,  we  use  lower  case  Roman  letters  to  denote 
realizations  of  random  quantities.  For  example,  yi^  represents  a  particular 
realization  of  Y^k. 

Consider  Table  3.1,  taken  from  Hout  et  al.  (1987). 

Table  3.1.  Interest  in  Political  Campaigns 

1960 


Not  Much         Somewhat         Very  Much 


Not  Much 


1956      Somewhat 


Very  Much 


155 

116 

64 

91 

237 

171 

32 

91 

246 

335 


499 


369 


278  444  481  1203 

Source:  Hout  et  al.  (1987),  p.  166,  Table  4 


Each  of  1203  randomly  selected  subjects  was  asked  in  1956  how  inter- 
ested they  were  in  the  political  campaigns.  They  responded  on  the  3-category 
ordinal  scale:    1  =  Not  Much,  2  =  Somewhat,  and  3  =  Very  Much. 

Then,  in  1960,  each  of  the  subjects  was  asked  the  same  question  and 
responded  on  the  same  3-category  ordinal  scale.  Using  the  above  notation. 
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we  let  F(^)  and  V^'^^  represent  the  responses  in  1956  and  1960.  Let  yij,  i,j  = 
1,2,3  represent  the  number  of  the  N  =  1203  subjects  responding  at  level 
i  in  1956  and  level  j  in  1960.  Notice  that  there  is  just  one  population 
of  interest,  we  drop  the  population  subscript  altogether.  Finally,  for  this 
bivariate  response  example,  the  compound  subscript  i  is  replaced  by  ij.  Table 
3.1  summarizes  the  bivariate  responses. 

As  another  example,  consider  the  cross-over  data  of  Ezzet  and  White- 
head (1991). 


1 

A     2 

3 

4 


Table  3.2.  Cross-over  Data 


B 


1 

2 

3 

4 

59 

35 

3 

2 

11 

27 

2 

1 

0 

0 

0 

0 

1 

1 

0 

0 

AB  Sequence 
(Group  1) 


1 

A     2 

3 

4 


B 


1 

2 

3 

4 

63 

40 

7 

2 

13 

15 

2 

0 

0 

0 

1 

1 

0 

0 

0 

0 

BA  Sequence 
(Group  2) 


The  counts  displayed  in  Table  3.2  are  from  a  study  conducted  by  3M 
Health  Care  Ltd.  to  compare  the  suitability  of  two  inhalation  devices  [A  and 
B)  in  patients  who  are  currently  using  a  standard  inhaler  device  delivering 
salbutomal.  Two  independent  groups  of  subjects  participated.  Group  1  used 
device  A  for  a  week  followed  by  device  B  (sequence  AB).  Group  2  used  the 
devices  in  reverse  order  (sequence  BA). 

The  response  variables  V^^^  (device  A)  and  F^^)  (device  B)  are  ordinal 
polytomous.  Specifically,  they  are  the  self- assessment  on  clarity  of  leaflet 
instructions  accompanying  the  two  devices,  recorded  on  the  ordinal  four  point 
scale, 
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1  =  Easy 

2  =  Only  clear  after  rereading 

3  =  Not  very  clear 

4  =  Confusing. 

For  this  example  there  are  two  populations  of  interest — Group  1  and 
Group  2.  Let  yij^  represent  the  number  of  the  Uk  subjects  responding  at  level 
i  for  device  A  and  level  j  for  device  B,  where  rii  =  142  and  n2  =  144.  Again, 
the  bivariate  response  profiles  can  be  denoted  by  z  =  ij  where  i,j  =  1,2, 3, 4. 
The  bivariate  responses  are  summarized  in  Table  3.2. 

3.3     Joint  and  Marginal  Models 

Two  types  of  questions  that  can  be  posed  about  Table  3.1  lead  to  quite 
distinct  types  of  models.  One  question  is  whether  the  interest  in  the  political 
campaigns  was  different  at  the  two  times.  For  example,  the  researcher 
may  wish  to  test  the  hypothesis  that  there  was  more  interest  in  the  1960 
political  campaign  than  the  1956  political  campaign.  An  investigation  into  the 
marginal  distributions  is  needed  to  test  this  hypothesis.  For  these  bivariate 
response  data,  the  marginal  distributions  correspond  to  the  row  and  column 
distributions  of  Table  3.1.  A  second  question  that  may  be  asked  is  whether 
the  two  responses  are  associated  and  if  so,  how  strong  is  the  association.  To 
answer  these  questions,  we  must  describe  the  dependence  displayed  in  the 
joint  distribution  of  Table  3.1. 

The  marginal  models  we  consider  will  be  used  to  investigate  whether 
the  probability  that  a  randomly  selected  subject  responds  at  level  i  or  lower 
in  1956  is  different  from  the  probability  that  a  randomly  selected  subject 
responds  at  level  i  or  lower  in  1960.  In  this  sense,  the  comparison  of  marginal 
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distributions  gives  a  'population  averaged'  description  of  change.  That  is,  we 
will  describe  how  the  marginal  distribution  changes  on  the  whole,  averaging 
over  the  entire  population.  In  contrast,  subject-specific  modeling  allows  us  to 
investigate  how  a  randomly  chosen  subject's  response  changes  from  1956  to 
1960.  Zeger  et  al.  (1988)  discuss  at  length  the  difference  between  population- 
average  and  subject-specific  models. 

The  same  types  of  questions  may  be  posed  about  the  distributions  of 
Table  3.2.  For  example,  one  may  wish  to  determine  whether  the  leaflet 
instructions  are  perceived  as  clearer  for  one  of  the  devices.  Also,  we  may 
be  interested  in  whether  there  is  a  sequence  effect.  That  is,  does  the  order 
of  'exposure'  to  the  two  device's  instruction  leaflet  affect  the  perception  of 
clarity.  To  answer  these  two  questions  we  must  investigate  the  marginal 
distributions  corresponding  to  the  row  and  column  totals  of  Table  3.2.  Finally, 
one  may  be  interested  in  testing  whether  the  association  between  the  two 
responses  is  the  same  for  both  sequences.  We  will  consider  modeling  the  joint 
distributions  to  answer  this  question. 

Modeling  of  marginal  distributions  is  usually  conducted  separately 
from  the  modeling  of  joint  distributions.  We  use  results  from  Chapter  2 
of  this  dissertation  to  show  that  these  models  can  be  fit  simultaneously 
using  maximum  likelihood  methods.  Simultaneously  modeling  the  joint  and 
marginal  distributions  leads  to  several  advantages.  It  will  provide  a  single 
test  for  overall  goodness  of  fit.  Also,  it  provides  improved  model  parsimony, 
potentially  resulting  in  better  estimates  than  one  would  obtain  by  fitting  the 
models  separately. 
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We  consider  four  classes  of  simultaneous  models.  Let  J{S)  represent 
the  class  of  saturated  joint  distribution  models.  These  models  imply  no 
structure  on  the  joint  distributions  and  therefore  allow  for  general  association 
between  the  T  responses.  Similarly,  let  M{S)  be  the  class  of  marginal 
models  that  assume  no  structure  on  the  marginal  distributions,  i.e.  M(5) 
is  the  class  of  saturated  marginal  models.  Denote  the  classes  of  unsaturated 
models  by  J{U)  and  M{U).  By  simultaneously  modeling  the  joint  and 
marginal  distributions  we  can  consider  four  classes  of  models,  J{S)  n  M{S), 
J{U)  n  M{S),  J{S)  n  M{U),  and  J{U)  n  M{U).  The  union  of  these  four 
classes  will  be  denoted  by  J  n  M.  We  let  the  symbol  J(Mi)  n  M(M2),  where 
Ml  and  M2  are  particular  models,  represent  a  specific  model  in  J  n  M.  Some 
examples  of  Mi  and  M2  are  Mi  =  QSY,  the  quasi-symmetry  model,  and 
M2  =  MH,  the  marginal  homogeneity  model.  The  two  symbols  S  and  U 
will  represent  either  the  'class'  of  saturated  and  unsaturated  models  or  an 
arbitrary  model  in  those  classes.  The  possibility  that  the  joint  distribution 
structure  implied  by  the  joint  model  J  {Mi)  will  imply  that  the  marginal 
distributions  are  constrained  in  some  way  is  always  there.  In  this  case  the 
model  may  not  be  well  defined  in  the  sense  of  Chapter  2.  We  address  this 
issue  in  section  3.6. 

The  first  class  of  models  J{S)  n  M{S)  is  the  class  of  completely 
unstructured  or  fully  saturated  models.  These  models  fit  the  data  perfectly 
and  are  used  primarily  for  exploratory  purposes.  If  an  estimated  freedom 
parameter  is  small  relative  to  its  standard  error,  the  corresponding  effect 
may  prove  to  be  negligible.  In  this  way,  the  fit  of  the  saturated  model  may 
suggest  simpler  models  that  may  fit  the  data  well. 
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The  models  in  class  J{U)  n  M{S)  focus  on  modeling  the  joint  distri- 
butions. No  additional  structure  on  the  marginal  distribution  is  assumed. 
This  class  includes  ordinary  loglinear  models  for  the  expected  cell  frequencies 
in  the  joint  distributions.  Fitting  this  simultaneous  model  is  equivalent  to 
separately  fitting  the  joint  distribution  model  J{U)m  that  the  goodness-of-fit 
statistic  and  joint  model  paxameter  estimates  will  be  exactly  the  same.  There 
is,  however,  some  benefit  to  fitting  the  simultaneous  model;  marginal  model 
parameter  estimates  are  obtained.  In  general,  these  J{U)  models  are  not 
designed  to  estimate  effects  in  marginal  distributions.  There  are  exceptions. 
For  example,  the  symmetry  model  for  the  joint  distribution  implies  that  all  of 
the  marginal  distributions  are  equal.  Bishop  et  al.  (1975)  discuss  comparing 
the  fit  of  the  symmetry  (SY)  model  to  the  fit  of  the  quasi-symmetry  (QSY) 
model  to  test  for  marginal  homogeneity.  Our  focus  will  be  on  models  that 
do  not  imply  any  structure  on  the  marginal  distribution.  Loglinear  models 
that  assume  no  relationship  among  the  main  effect  parameters  satisfy  this 
condition. 

The  models  in  class  J(S)nM{U)  are  used  to  answer  questions  about  the 
marginal  distributions.  They  assume  no  structure  for  the  joint  distribution 
and  hence  allow  for  general  association  among  the  responses.  Fitting  a 
J{S)  n  M{U)  model  is  equivalent  to  separately  fitting  the  M{U)  model  in 
that  the  goodness-of-fit  statistic  and  the  marginal  model  paxameter  estimates 
are  exactly  the  same.  A  simple  M{U)  model  that  is  often  of  interest  is 
the  marginal  homogeneity  (MH)  model.  Madansky's  (1963)  test  of  marginal 
homogeneity  is  simply  the  likelihood-ratio  test  comparing  the  fit  of  J (5)  n 
M{MH)  to  the  saturated  model  J{S)  n  M{S).    For  bivariate  dichotomous 
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response  data,  an  analogous  test  using  the  Lagrange  multiplier  statistic 
(which  is  shown  to  be  equal  to  Pearson's  chi-squared  statistic  in  Chapter 
2)  is  McNemar's  (1947)  test. 

In  this  chapter,  we  will  focus  primarily  on  the  parsimonious  models 
within  the  class  J{U)  n  M{U).  Often  times,  a  simple  model  can  be  found 
that  fits  the  data  relatively  well.  Simultaneous  inferences  about  both  the 
association  structure  and  the  marginal  distribution  structure  can  be  made 
using  the  model  or  freedom  parameter  estimates,  or  goodness-of-fit  statistics. 
Also,  by  the  parsimony  principle,  the  parameter  estimates  may  be  more 
reliable  than  those  based  on  less  structured  models.  See  Agresti  (1990)  and 
Bishop  et  al.  (1975)  for  a  discussion  of  the  benefits  of  using  parsimonious 
models.  We  can  use  models  within  this  class  to  test  such  things  as  MH 
given  that  QSY  holds.  This  can  be  accomplished  by  comparing  the  fit  of 
J{QSY)  n  M{MH)  to  the  fit  of  J{QSY)  n  M{S).  More  generally,  we  may 
wish  to  test  for  MH  given  that  some  simple  model  M\  holds  for  the  joint 
distribution. 

Let  ^k  =  il^iky  •  •  1  f^Rk)'  be  the  vector  of  expected  frequencies  for 
population  k.  That  is 

The  RK  X  1  vector  ^  is  defined  as  ^  =  vec(/^i,/^2,  •  •  •  ?Mif)-  For  the  marginal 
distributions,  let  {mi(t;  fc)  =  njk</>j(<;  fc)}  represent  the  marginal  distribution 
expected  cell  frequencies.  Cumulative  marginal  probabilities  will  be  denoted 
by  r7i(<;  A;),  i.e., 

r}i{t;k)  =  '^(i}^{t;k),    i  =  l,...,d. 
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We  consider  models  in  the  following  classes: 

J:     CilogAi/z  =  Xi^i,     orLi/i  =  Xi/?i 

(3.3.1) 
M  :     Ci  log  A2/i  =  -X'2/32     or  Li^  =  -^2^2- 

The  matrices  Cj  and  C^  are  either  identity,  contrast  (rows  sum  to  zero), 
or  zero  matrices.  The  model  matrices  X\  and  X-i,  are  assumed  to  be  of  full 
column  rank.  We  refer  to  the  parameters  in  vectors  ^\  and  ^^  as  freedom 
parameters,  whereas  the  components  of  the  parameter  vector  [i  will  be  called 
model  parameters. 

Evidently,  the  class  of  models  JnM  of  (3.3.1)  is  very  broad.  Permissible 
models  for  the  joint  distributions  include  simple  loglinear  models  as  well  as 
models  for  log  odds  ratios  using  individual  cells  (e.g.  local  odds  ratios) 
or  groupings  of  cells  (e.g.  global  odds  ratios  which  are  cross-product 
ratios  of  quadrant  probabilities,  cf.  Dale,  1986).  The  marginal  models  of 
class  M  can  be  loglinear  or  corresponding  logit  models  (such  as  adjacent- 
categories  or  baseline-categories  logit  models)  or  they  can  be  other  types  of 
multinomial  response  models,  such  as  cumulative  or  continuation-ratio  logit 
models  (Agresti,  1990).  The  second  form  for  each  model  in  (3.3.1)  allows  for 
linear  probability  or  mean  response  models  (Grizzle  et  al.,  1969).  All  of  the 
models  in  JnM  can  be  fit  using  the  methods  of  Chapter  2.  We  illustrate  the 
usefulness  of  these  models  by  way  of  example. 

3.4     Numerical  Examples 

Example  1.  We  begin  by  simultaneously  modeling  the  joint  and  marginal 
distributions  for  Table  3.1.  Recall  that  response  variable  V(^)  represents  a 
randomly  chosen  subject's  response  to  the  political  interest  question  in  1956 
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and  V'(2)  is  a  randomly  chosen  subject's  response  to  the  poUtical  interest 
question  in  1960.  Some  candidate  models  for  the  joint  distribution  of 
(F(i),F(2))  include  the  following: 


J{I) 

J{QSY) 

J{LxL) 

J{LxL  +  D) 

J[S) 


log  fXij  =  a  +  aY''^^  +  aj''^^  +  OuiVj 

log^ij  =  a  +  af^^  +  af'^  +  duiVj  +  SI{i  =  j) 

logfiij  =  a  +  ay      +  aj      +  a-^-    ^ 


where  I  =  independence,  Q5y  =  quasi  symmetry,  L  x  L  =  linear- by- linear 
association,  and  L  x  L  +  D  also  adds  a  main-diagonal  parameter.  The 
latter  two  models  recognize  the  ordinality  of  the  measurement  scale,  through 
sets  of  monotone  scores  {ui}  for  V'(^)  and  {vj}  for  V^^h  The  L  x  L  form  of 
model  fits  well  when  underlying  continuous  variables  have  a  bivariate  normal 
distribution  (Goodman,  1981;  Becker,  1989),  and  extra  parameters  for  the 
main  diagonal  can  account  for  larger  frequencies  often  observed  there  when 
both  dimensions  have  the  same  categories. 

Candidate  models  for  the  marginal  distributions  oi  V  =  (F(^),F(2)) 
include  the  following: 

M{MH):  logmi{t)=/3+/3l^+/3Y 

M{L  X  L)  :  logm,(<)  =  /3  +/3f  -F/J^  +  A'^^^i 

M{CU):  \ogii'ni{t)  =  uji  +  -it 

M{S)  :  log m.(<)  =  /3  +  ^^  +  (3^  +  ^^ 

where  CU  denotes  the  cumulative  logit  and  the  superscript  R  is  used  to 
label  those  parameters  related  to  the  'level'  of  response.   There  is  marginal 
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homogeneity  if  there  is  no  association  between  level  of  response  [R)  and 
response  variable  {V)  (cf.  Agresti,  1989).  When  the  number  of  levels  of 
V  exceeds  two  (i.e.  T  >  2)  and  V  can  be  considered  ordinal,  rather  than 
assume  that  there  are  general  row  effects  for  levels  of  V,  one  could  account 
for  the  ordinality  by  introducing  scores  for  the  levels  of  V.  That  is,  we  could 
replace  /3f-^Ui  by  jS^^UiVt  in  the  loglinear  model  and  replace  jt  by  'yvt  in  the 
cumulative  logit  model.  An  example  where  we  can  consider  V  as  ordinal  is 
when  the  T  responses  represent  repeated  measures  over  time.  The  T  levels 
of  V  are  then  naturally  ordered;  response  at  occasion  1  (V(^)),  response  at 
occasion  2  (F(2)),  . . .,  response  at  occasion  T  (F(^)).  For  model  identifiability, 
certain  parameters  (or  more  generally,  linear  combinations  of  parameters) 
were  set  to  zero.  For  example,  the  parameter  72  of  model  M[CU)  was  set  to 


zero. 


To  obtain  information  about  which  simultaneous  models  may  fit  well, 
we  first  investigate  joint  and  marginal  models  separately.  Table  3.3  contains 
likelihood-ratio  (G^)  and  Pearson  (X^)  goodness-of-fit  statistics  for  several 
models  in  the  class  J{U)r\M{S).  The  associated  distance  or  residual  degrees 
of  freedom  are  listed  as  well.  The  linear-by-linear  terms  used  equally  spaced 
scores  for  rows  and  for  columns. 

Table  3.3.  Joint  Distribution  Models — Goodness  of  Fit 

Model  df  G^  X^ 

J(5)nM(5)  0  0.00  0.00 

J{QSY)nMls)  1  0.39  0.39 

J{LxL  +  £))nM(5)  2  0.49  0.49 

J{LxL)nM{S)  3  18.58  18.72 

J{I)nM{S)  4  245.01  253.09 
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Both  J{QSY)  and  the  simpler  J{L  x  L  +  D)  models  fit  well.  Notice 
that  the  independence  model  fits  poorly  as  is  usually  the  case  for  longitudinal 
data. 

We  next  fit  several  models  in  the  class  J{S)nM{U).  The  goodness-of-fit 
statistics  and  the  associated  residual  degrees  of  freedom  for  these  marginal 
models  are  tabled  in  Table  3.4. 


Table  3.4.  Marginal  Distribution  Models — Goodness  of  Fit 

Model  df  G^  X^ 

J{S)nM{CU)  1  3.35  3.35 

J{S)nM{LxL)  1  4.21  4.20 

J{S)nM{MH)  2         38.22        37.49 


There  is  very  strong  evidence  of  marginal  heterogeneity  as  measured  by  the 
goodness-of-fit  statistic  for  the  model  J{S)  n  M[MH)  or  as  measured  by  a 
comparison  of  that  fit  with  the  fit  of  some  unsaturated  model  that  allows  for 
marginal  heterogeneity. 

Finally,  we  will  try  to  find  a  good  fitting,  parsimonious  model  in  the 
class  J{U)  n  M{U)  that  simultaneously  describes  the  joint  and  marginal 
distributions.  Since  the  model  J(L  x  L  -\-  D)  fits  the  data  very  well,  we 
will  assume  this  structure  for  the  joint  distribution  and  simultaneously  fit 
several  candidate  marginal  models.  In  section  3.5,  we  show  that  the  model 
J{L  xL  +  D)  belongs  to  a  class  of  joint  distribution  models  that  do  not  imply 
any  structure  on  the  marginal  distribution.  It  therefore  follows  that  residual 
degrees  of  freedom  for  the  simultaneous  model  J{L  x  L  +  D)  n  M{U)  can  be 
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computed  as  follows, 

df,,s[J{L  xL  +  D)n  M{U)]  =  dU[J{L  xL  +  D)]  +  dU[M{U)]. 

This  follows  since  the  model  is  well  defined  in  the  sense  of  Chapter  2  and  since, 
for  well  defined  models,  residual  degrees  of  freedom  is  simply  the  difference 
between  the  number  of  constraints  implied  by  the  simpler  model  and  the 
number  of  constraints  implied  by  the  less  structured  model.  Table  3.5  contains 
the  result  of  fitting  several  models  in  the  class  J [L  x  L  -\-  D)  f\  M{U). 


Table  3.5.  Candidate  Models  in  J{L  x  L  +  D)  n  M(C/)— Goodness  of  Fit 

Model  df  G^  A^ 

J(LxL  +I>)nM(5) 

J{L-kL  +D)nM{CU) 

J{LxL  +D)nM{LxL) 

J{LxL  +D)n  M{MH) 


2 

0.49 

0.49 

3 

3.84 

3.82 

3 

4.68 

4.66 

4 

38.73 

38.15 

The  simple  model  J{L  x  L  +  £>)  n  M{CU)  fits  the  data  very  well 
(G^  =  3.84,  df  =  3).  This  model  implies  that  the  joint  and  marginal 
distributions  simultaneously  follow  the  models, 

J{LxL  +  D):  log  ^ij  =  a  +  aV^'^  +  af '^  +e-ij  +  61{i  =  j) 

M{CU):  logitrji{t)=aji  +  jt 

In  Table  3.6,  we  give  the  ML  estimates  of  the  freedom  parameters  for 
this  model  along  with  their  corresponding  estimate  of  standard  errors. 
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Table  3.6.  Estimates  of  Freedom  Parameters  for 
Model  J{LxL  +  D)n  M{CU) 

Parameter         Estimate         Std.  Error 


a 

0.085 

0.662 

2.430 

0.349 

af^^ 

1.605 

0.203 

af'^ 

1.606 

0.325 

«2 

1.172 

0.192 

e 

0.563 

0.081 

6 

0.355 

0.084 

Wi 

-1.255 

0.063 

UJi 

0.435 

0.057 

Ti 

0.341 

0.058 

To  test  for  marginal  homogeneity  in  the  context  of  this  model,  we  can  use 
either  of  two  asymptotically  equivalent  X^(l)  test  statistics: 

G2  =  38.73 -3.84  =  34.89 

~  ^0.058^  -^4.t)^ 
where  W^  is  the  squared  Wald  statistic.  The  P-values  for  both  of  these  tests 
are  less  than  0.001.  We  conclude  that  there  is  strong  evidence  of  marginal 
heterogeneity.  We  need  not,  and  should  not,  stop  here.  Since  we  are  working 
with  model  and  freedom  parameters,  we  can  continue  with  other  model-based 
inferences.  Interval  estimation  of  certeiin  interesting  freedom  parameters  is 
considered  next. 

The  interpretation  of  the  parameter  71  is  as  follows:  The  odds  that  a 
randomly  selected  subject  would  have  responded  at  level  i  or  less  in  1956  is 
exp(7i)  times  higher  than  the  odds  that  a  randomly  selected  subject  would 
have  responded  at  level  i  or  less  in  1960.  Thus,  the  freedom  parameter  71 
measures  the  departure  from  marginal  homogeneity  in  that  the  two  odds  are 
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identical  if  and  only  if  71  =  0.  We  use  the  delta  method  to  compute  a  95% 
confidence  interval  for  the  odds  ratio  exp(7i);  it  is  [  1.324  ,  1.488  ].  Thus, 
based  on  the  data  at  hand,  we  estimate  that  the  odds  that  a  subject  would 
respond  at  level  i  or  less  in  1956  is  between  1.324  and  1.488  times  higher 
than  the  odds  that  a  subject  would  respond  at  level  i  or  less  in  1960.  There 
is  significant  evidence  of  increased  political  interest  in  1960  relative  to  1956. 
Next  we  consider  the  association  between  the  two  responses.  The  estimated 
odds  that  the  response  in  1960  was  'very  much'  instead  of  'somewhat'  is 
exp(^  +  26)  =  3.57  times  higher  when  the  response  in  1956  was  'very  much' 
than  when  it  was  'somewhat'.  The  same  estimated  odds  ratio  applies  when 
the  response  was  'somewhat'  instead  of  'not  much'.  Similarly,  the  estimated 
odds  that  the  response  in  1960  was  'very  much'  instead  of  'not  much'  is 
ex.p{49  +  2S)  =  19.34  times  higher  when  the  response  in  1956  was  'very  much' 
than  when  in  was  'not  much'.  In  summary,  there  is  evidence  of  strong  positive 
association  between  the  response  in  1956  and  the  response  in  1960  and  there 
is  evidence  that  there  was  greater  political  interest  in  1960  than  in  1956. 

Suppose  we  ignored  the  fact  that  the  same  subjects  responded  to  the 
political  interest  question  in  1956  and  1960.  If  we  treated  the  two  responses  as 
independent,  then  the  row  and  column  marginal  counts  would  be  distributed 
as  independent  multinomials  with  the  same  index  N  =  1203  and  probability 
vectors  {4>i{l)}  and  {(/»_;(2)}.  Then  it  follows  that  separately  fitting  the 
marginal  model  M(U)  under  this  independence  assumption  is  equivalent 
to  fitting  the  simultaneous  model  J{I)  n  M{U).  By  results  of  Liang  and 
Zeger  (1986),  the  estimates  of  parameters  in  M{U)  would  be  consistent,  even 
when  the  responses  are  not  truly  independent.    However,  the  estimates  of 
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the  corresponding  standard  errors  would  no  longer  be  valid.  One  way  to  see 
that  we  are  losing  information  by  incorrectly  assuming  independence  is  by 
comparing  the  likelihood-ratio  statistic  for  testing  MH  assuming  J(7)  holds 
to  the  likelihood-ratio  statistic  for  testing  MH  assuming  that  J[L  x  L  +  D) 
holds.  The  former  is  G^  =  268.33-247.74  =  20.59  and  the  latter  is  G^  =  34.89. 
Both  of  these  values  would  be  compared  to  a  tabled  X^(l)  value.  Evidently, 
by  accounting  for  the  dependence  between  the  responses  we  have  greater 
evidence  of  marginal  heterogeneity.  Another  way  of  illustrating  the  effect 
of  wrongly  assuming  independence  between  the  responses  is  by  looking  at 
the  freedom  parameter  estimates  and  their  estimated  standard  errors  for 
different  models.  Table  3.7  contains  estimates  of  ji  and  the  corresponding 
standard  error  estimate  under  three  different  models  of  interest.  Notice  that 
the  standard  errors  are  similar  when  one  used  either  the  saturated  or  the 
diagonal  parameter  model  for  the  joint  distribution. 


Table  3.7.  Freedom  Parameter  Estimates  and  Standard  Errors 
Model  df  7i  se(7i) 

J(S)nM{CU)  1         0.342         0.058 

J{LxL  +  D)nMlcU)  3         0.341         0.058 

J{I)nM{CU)  5         0.343         0.076 


We  have  shown  that  there  may  be  problems  with  assuming  too  much 
structure  on  the  joint  distribution;  for  example,  unreasonably  assuming 
independence.  Similarly,  we  should  be  concerned  with  assuming  too  little 
structure  on  the  joint  distribution.  In  this  case,  too  many  freedom  parameters 
require  estimation  and  the  overall  fit  may  be  unreliable.  A  good  model  is  one 
that  fits  the  data  at  hand  relatively  well  and  is  robust  to  the  white  noise 
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present  in  the  data  generation.  That  is,  a  good  fitting  model  with  model 
parameter  estimates  that  change  very  little  for  different  realizations  of  the 
random  data  vector,  is  considered  a  good  model.  For  example,  the  saturated 
model  fits  perfectly  but  has  parameter  estimates  that  may  change  greatly 
for  different  realizations.  In  this  sense  the  saturated  model  may  not  be  a 
good  one;  it  may  be  unreliable.  When  we  ignore  the  association  structure  by 
separately  fitting  marginal  models,  we  are  tacitly  using  the  saturated  model 
for  the  joint  distribution.  Table  3.8  illustrates  why  we  should  search  for  a 
good  fitting,  parsimonious  model.  Note  that  the  standard  errors  of  expected 
cell  frequency  estimates  are  inflated  when  we  assume  a  saturated  model  for 
the  joint  distribution.  The  more  parsimonious  model  J{L  xL  +  D)r\  M{CU) 
fits  as  well  as  the  less  structured  model  J{S)nM{CU),  yet  it  is  more  reliable 
in  the  sense  described  above. 


Table  3.8.  Estimated  Cell  Means  and  Standard  Errors 
for  Models  J{S)  n  M{CU)  and  J{LxL  +  D)n  M{CU) 


J{S)nM{CU)  J{LxL  +  D)nM{CU) 

152.79  11.49  154.28  10.56 

127.00              8.80  123.08  6.56 

64.87              7.82  66.98  7.16 

82.74              7.53  83.25  4.89 

237.30  13.80  237.30  13.80 

159.53              9.95  159.00  8.12 

31.41              5.52  29.37  4.10 

99.14              8.44  103.05  6.28 

248.23  13.98  246.70  13.16 
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Example  2.    We  continue  with  the  cross-over  data  example  of  section  3.2. 
Denote  the  set  of  18  local  odds  ratios  by  {Tijk},  where 

'ijk  —   _  _ 

T^i+l,jk^i,j+l,k 

and  TTiji^  represents  the  probability  that  a  randomly  chosen  subject  from 
Group  (G)  k  responds  at  the  i"*  level  for  device  A  (V(^))  and  the  j*''  level  for 
device  B  (y^^)).  Recall  that  cumulative  marginal  probabilities  are  denoted 
by 

^.(f.  k)=  S  S^=i  '^''+*='     if  *  =  1  (device  A) 
lEUTT+.fc,     if  i  =  2  (device  S) 

where  i  =  1,2,3,4  and  k  =  1,2.  To  elucidate,  773(2;  l)  represents  the 
probability  that  a  randomly  chosen  subject  from  Group  1  will  respond  at 
level  3  or  lower  for  device  B  (V^^^). 

Some  possible  models  for  the  joint  distributions  of  (Vj^    ,  ^k    /'^  k  =  1,2 
include  the  following: 

^ogHijk  =  atjk 


J{S) 
J{V(^)G,V(^)G,V(W(^)) 


+a^  +  a; 


vWg 


ik 


+  a 


vWg 
jk 


+  ay. 


K(i)v(J) 


J(LxL): 

J(y(i)G,F(2)G) 

J(F(^),y(2),G) 

JiUAiG)) 

J{UA) 


yd) 


log/.,,fc  =  a  +  ar'  +  af'  +  a^  +  aY,'"'^ 

log ;.,,,  =  a  +  ar'  +  af'  +  a«  +  a^'^'o  +  ^  .^ 

^OgTijk  =u> 


vWg 


where  J(5)  is  a  fully  saturated  model,  J{V^^)G,V^^)G,V(W(^))  assumes 
no  three-factor  interaction,  J{L  x  L)  implies  that  there  is  no  three-factor 
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interaction  and  that  the  association  between  the  ordinal  responses  can 
be  accounted  for  by  including  a  linear-by-linear  association  parameter, 
J(F(i),y(2),G)  is  the  mutual  independence  model,  and  J{V^^)G,V(^)G) 
implies  that  F(^)  and  F^^)  are  conditionally  independent  given  G.  The  model 
J{UA{G))  implies  uniform  association  within  levels  of  G,  and  J{UA)  is  the 
simple  model  that  assumes  this  uniform  association  is  the  same  for  both  levels 
of  G.  When  the  row  and  column  scores  {Ui}  and  {vj}  are  equally  spaced, 
models  J{L  x  L)  and  J{UA)  are  equivalent.  It  is  shown  in  section  3.6  that 
model  J(F(i),F(2),G)  implies  that  the  marginal  distributions  of  (F(i),1^(2))' 
do  not  depend  on  G.  When  this  happens,  the  simultaneous  model  will  be  ill 
defined  whenever  the  marginal  model  constrains  the  marginal  distributions  to 
be  equal  across  levels  of  G.  We  will  not  consider  this  particular  model  for  this 
reason.  The  rest  of  the  models  do  not  imply  any  structure  on  the  marginal 
distributions.  Also,  notice  that  simultaneously  fitting  J{V^^^G,  V(^)G)  and 
some  marginal  model  M{U)  is  equivalent  to  separately  fitting  M{U)  when 
the  row  and  column  marginal  counts  are  treated  as  independent  multinomials 
within  each  level  of  G. 

The  marginal  models  we  fitted  include  the  following  cumulative  logit 
models: 

M{S):  logitMt;  k))  =  f3itk 


M{VG):  \ogit{rji{t;  k))  =  ^i  +  ^Y  +  /3f  +  A 

M{V,  G):  logit(r7i(<;  k))  =  A  +  yS^  +  /^f 

M{V):  \ogitMt-k))=^i+^Y 

M(l):  logit{rii{t;k))  =  ^i 


VG 
tk 
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where  M{S)  is  the  saturated  model  and  M{VG)  is  the  proportional-odds 
cumulative-logit  model  for  the  marginal  probabilities  that  allows  for  otherwise 
general  association  between  the  response  variable  V,  the  group  or  population 
variable  C?,  and  the  response  'level'  R.  In  the  literature  on  cross-over  designs, 
a  second-order  interaction  among  V,  G,  and  R  is  said  to  be  a  'carryover'  effect. 
The  model  M(y,  G)  implies  that  there  is  no  second-order  interaction  among 
the  variables  V,  G,  and  R,  i.e.  the  model  implies  that  there  is  no  carryover 
effect.  The  model  M(y)  implies  that  there  is  no  G  effect,  i.e.  no  sequence 
effect.  Finally,  the  simple  model  M(l)  implies  that  there  is  no  F  or  G  effect. 
To  make  these  models  identifiable,  we  place  the  following  restrictions  on  the 
freedom  parameters. 


.vG^r^^^  if 


t  +  k  =  3 
otherwise 


With  this  parameterization,  /3^ ,  /3^,  and  j3^G  measure  device,  sequence,  and 
carryover  effects,  respectively. 

Table  3.9  displays  the  goodness-of-fit  statistics  and  their  associated 
degrees  of  freedom  for  several  simultaneous  models.  The  L  x  L  model  used 
the  equally  spaced  row  and  column  scores  {u,-  =  i}  and  {Vj  =  j}. 
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Table  3.9.  Cross-over  Data  Models — Goodness  of  Fit 

Model  df  G2  X2 


J(5)nM(5) 

0 

0 

0 

J{S)nM{VG) 

6 

10.55 

6.91 

J{UA)nM{S) 

17 

17.36 

9.19 

J(y(i)G,  02)G,  F(i)F(2))  n  M{VG) 

15 

14.28 

10.65 

J{LxL)nM{VG) 

23 

28.52 

27.00 

J(F(i)G,y(2)G)nM(FG) 

24 

37.92 

58.77 

J{UA{G))nM{VG) 

22 

28.45 

26.11 

J{UA)  n  M{VG) 

23 

28.52 

27.00 

J{UA)nM{V,G) 

24 

29.97 

29.64 

J{UA)nM{V) 

25 

31.05 

30.32 

J{UA)nM{l) 

26 

70.51 

64.87 

Evidently  the  parsimonious  model  J{UA)nM{V)  fits  the  data  very  well. 
This  model  implies  that  there  is  no  period  or  carryover  effect  and  that  the 
uniform  association  structure  is  the  same  for  each  sequence  group.  There  is 
evidence  of  a  significant  device  effect  (G^  =  70.51  -  31.05  =  39.46,  df  =  l). 
We  will  proceed  to  describe  this  device  effect.  The  freedom  parameter  ML 
estimates  and  their  corresponding  standard  error  estimates  are  tabled  in  Table 
3.10. 


Table  3.10.  Freedom  Parameter  ML  Estimates 
for  Model  J{UA)  n  M{V) 


Parameter 

Estimate 

St  Error 

a; 

0.469 

0.148 

A 

0.542 

0.096 

02 

3.189 

0.219 

A 

4.360 

0.375 

^y 

0.511 

0.082 

These  estimates  also  indicate  that  there  is  a  significant  device  effect;  the 
Wald  statistic  which  is  based  on  1  degree  of  freedom  takes  on  the  value  of 
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W^  =  (  f^vJ  =38.8.  The  magnitude  of  the  device  effect  can  be  estimated 
using  f3^.  Specifically,  the  odds  of  responding  ^  +  1  or  higher  for  device  B  is 
estimated  to  be  e^^  =2.78  times  higher  than  the  odds  for  device  A.  Using 
the  delta  method,  an  approximate  95%  confidence  interval  for  this  odds  ratio 
is  (1.87,  3.69).  Since  the  higher  responses  correspond  to  less  perceived  clarity 
of  the  instructional  leaflet,  we  conclude  that  there  is  evidence  suggesting  a 
significant  improvement  of  device  A  over  device  B  in  terms  of  perceived  clarity 
of  instructions.  We  can  describe  the  association  between  the  two  responses 
using  cD.  For  either  sequence  group,  the  odds  of  responding  at  level  i  instead 
of  i  + 1  for  device  A  is  estimated  to  be  exp(0.469)  =  1.6  times  higher  when  the 
response  for  device  B  was  i  rather  than  i  +  1.  This  holds  for  each  i  and  j.  In 
summary,  there  is  a  moderate  positive  association  between  the  two  responses, 
the  strength  of  association  being  the  same  for  both  sequence  groups.  There 
also  is  significant  evidence  of  increased  perceived  clarity  for  device  A  over 
device  B. 


3.5.  Product-Multinomial  Versus  Product-Poisson 
Estimators:  An  Application 

In  this  section  and  in  section  3.6,  we  explore  some  of  the  more  practical 
aspects  of  model  fitting  for  categorical  data.  In  this  section  we  will  illustrate, 
by  way  of  example,  how  to  determine  when  inferences  based  on  freedom 
parameters  will  be  the  same  under  both  sampling  assumptions — product- 
multinomial  and  product-Poisson.  The  method  of  determination  is  a  direct 
consequence  of  Theorem  2.4.2.  In  section  3.6,  we  address,  at  least  partially. 
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the  issue  of  whether  or  not  the  model  is  well  defined.  Closely  related  to  this 
is  the  computation  of  residual  and  model  degrees  of  freedom. 

Consider  the  data  taken  from  the  Harvard  Study  of  Air  Pollution  and 
Health.  The  data,  displayed  in  Table  3.11,  can  be  found  in  Agresti  (1990, 
p. 414);  they  were  supplied  by  Dr.  James  Ware. 


Table  3.11.    Children's  Respiratory  Illness  Data 


No  Maternal 
Child's  Respiratory  Illness         Smoking 


Age  7Age  8Age  9 


No 

No 

No 
Yes 

Yes 

No 
Yes 

Yes 

No 

No 
Yes 

Yes 

No 
Yes 

Age  10 


Maternal 
Smoking 

Age  10 


No 

Yes 

No 

Yes 

237 

10 

118 

6 

15 

4 

8 

2 

16 

2 

11 

1 

7 

3 

6 

4 

24 

3 

7 

3 

3 

2 

3 

1 

6 

2 

4 

2 

5 

11 

4 

7 

Source:    Agresti  (1990,  p.414),  supplied  by  Dr.  James  Ware 


The  two  groups  of  children — those  with  smoking  mothers  and  those  with 
nonsmoking  mothers — were  followed  for  four  years,  from  age  7  to  age  10.  At 
each  occasion,  each  child  was  tested  for  respiratory  illness.  The  response 
vector  for  the  fc*'»  {k  =  1,2)  group  of  children  is  Vk  =  {V^^\V^^\V^^\V^^^), 
where  response  V^  '  is  binary;  either  the  disease  is  present  or  it  is  not.  Our 
goal  is  to  find  a  parsimonious,  simultaneous  model  that  fits  the  data  well. 
Using  this  model,  we  will  be  able  to  address  questions  such  as  "is  mother's 
smoking  status  associated  with  the  child's  respiratory  illness  status"  or  "are 
the  odds  of  having  respiratory  illness  the  same  for  all  four  years?" 
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After  fitting  several  simultaneous  models,   we  finally  settled  on  the 
following  good-fitting   (G^  =  14.33,  df  =  22)    simultaneous  model. 

J  :  log/.,,.,,,  =  a  +  aV'''  +  af'  +  a^'^  +  a^'  +  af  +  a^''^  +  aj^''^ 

(3.5.1) 
M:  logit(l)i{t;s)  =  e  +  eY,  (3.5.2) 

where  6Y  satisfies  the  following. 

This  model  ((3.5.1)  n  (3.5.2))  implies  that  there  are  no  three-factor 
interactions  among  the  five  factors — the  four  responses  and  the  covariate, 
there  is  no  significant  group  (Smoker)  effect,  and  that  there  is  marginal 
homogeneity  among  the  first  three  times.  There  is  an  indication  that  the 
odds  of  having  respiratory  illness  are  lower  when  the  child  is  10  years  old.  In 
fact,  the  test  statistic  value  used  for  testing  marginal  homogeneity  across  all 
four  times  was  significantly  large  {G^  =  24.29  -  14.33  =  9.96,  df  =  1). 

Our  objective  in  this  section  is  to  determine  which  of  the  freedom 
parameter  estimates,  if  any,  are  affected  by  assuming  the  counts  are  product- 
Poisson  rather  than  product-multinomial.  We  will  use  Theorem  2.4.2.  To 
invoke  the  results  of  that  theorem  more  directly,  we  will  rewrite  the  model 
using  the  matrix  notation  of  Chapter  2.  The  model  can  be  written  as 

C  log  Afx  =  X  (3, 


where 


and 
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A  =  {A\,A',y 
X  =  Xi  ®  X2 
/3  =  vec(A,  /?2) 


Ci  =  e?/i6  =  /• 


32 


c,  = 


/I 

-1 

0 

0 

0 

0 

0 

0  \ 

2 

0 

0 

1 

-1 

0 

0 

0 

0 

1 

0 

0 

0 

0 

1 

-1 

0 

0 

^0 

0 

0 

0 

0 

0 

1 

-1/ 

Ai  =  ©f /16  =  /; 


32 


A.= 


/I  11111110000000  0\ 
0000000011111111 
1111000011110000 
0000111100001111 
1100110011001100 
0011001100110011 
1010101010101010 

VO  10101010101010  1/ 
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Xi 


/I 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1\ 

1 

0 

1 

0 

1 

1 

0 

1 

0 

0 

1 

1 

1 

1 

1 

0 

1 

0 

1 

0 

1 

0 

1 

0 

1 

0 

0 

0 

0 

1 

0 

1 

0 

1 

0 

1 

1 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

1 

0 

1 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

1 

0 

1 

1 

0 

0 

0 

1 

1 

1 

0 

1 

0 

0 

1 

0 

0 

0 

0 

1 

0 

0 

0 

1 

1 

0 

1 

1 

0 

0 

0 

0 

1 

0 

0 

1 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

1 

1 

1 

1 

1 

1 

1 

0 

0 

0 

0 

0 

1 

1 

0 

1 

0 

0 

1 

0 

0 

0 

0 

0 

1 

0 

1 

0 

1 

0 

1 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

1 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

1 

1 

1 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

\1 

0 

0 

0 

0 

0 

X2 

0 

0 

0 

1\ 

1 

1 

0 

1 

1 
1 

0 

0 

0 

0 

0 

0 

0/ 

Vi  oy 


fl    -(^   ^V(i)    ^VW    ^VW    _,V(*)    ^5   ^V(i)S    ^VWS    ^V^^^S    ^V(*)5 


a 


y(i)V(»)       v(i)v(»)       v(i)v(4)       v(»)v(s)       v(')y(*)       v(»)y(*)y 


11 


,<^11 


a 


11 


,«11 


,"ll 


,a 


11 
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and 

^2  =  (9,9^)'. 

Also,  the  vector  of  expected  cell  counts  /^  is  a  2  •  2^  x  1  vector  and  is 
defined  as 

A*  =  (/^inil  5/^11121,-  • -5/^22221  J /^lim  J-  •  •?/^22222)'- 

That  is,  the  last  subscript  (corresponding  to  the  s*'*  group)  is  changing  the 
slowest  and  the  other  4  subscripts  are  in  lexicographical  order. 

In  view  of  Theorem  2.4.2,  we  must  determine,  for  i  =  1,2,  whether  or 
not  Ci  is  a  contrast  matrix.  If  it  is  not,  then  we  must  find  those  columns  of  Xi 
that  span  a  set  containing  the  range  space  of  ®jlmi,  where  q^  =mi  =  16  and 
92  =  4  7^  m2.  Recall  that  5,-  is  the  number  of  response  functions  within  each 
independent  population  for  the  t*''  model.  For  example,  for  this  data  set,  the 
second  model  {i  =  2),  which  is  the  marginal  model  (3.5.2),  has  52  =  4  logits  to 
be  modeled  within  each  of  the  two  population  groups  (children  with  smoking 
mothers  and  children  with  nonsmoking  mothers).  As  in  the  statement  of  the 
theorem,  we  will  find  a  minimal  spanning  subset. 

Since  matrix  Ci  is  not  a  contrast  matrix,  we  wish  to  find  the  columns 
of  Xi  that  span  a  space  containing  the  range  space  of  ©jlie-  With  the 
parameterization  we  have  used,  we  can  easily  see  that  the  first  and  the 
sixth  columns  of  Xi  span  the  required  space.  Also,  C2  is  a  contrast  matrix. 
Therefore,  it  follows  by  Theorem  2.4.2  that  the  two  asymptotic  variances  of 
the  freedom  parameter  estimators,  computed  under  the  two  different  sampling 
assumptions,  axe  related  as  follows. 


-117- 

where  A^^  is  a  16  x  16  matrix  with  zeroes  everywhere  except  in  rows  1  and  6 
and  columns  1  and  6  and  all  the  other  A'-^'s  axe  zero  matrices. 

Table  3.12  displays  the  freedom  parameter  estimators  and  their  es- 
timated standard  errors,  which  were  calculated  under  the  two  sampling 
assumptions.  Notice  that  only  those  standard  errors  corresponding  to  the 
parameters  a  and  af  are  different  for  the  two  sampling  schemes.  These  are 
the  parameters  that  correspond  to  the  first  and  sixth  columns  of  Xi . 


Table  3.12.    Product-Multinomial  versus  Product-Poisson 
Freedom  Parameter  Estimation 


Product-Multinomial 

Product-Poisson 

Parameter 

Estimate 

Standard  Error 

Standard  Error 

a 

1.67 

0.216 

0.228 

af^^ 

-1.20 

0.304 

0.304 

af'^ 

-1.35 

0.342 

0.342 

a^*^ 

-1.07 

0.266 

0.266 

aY'*' 

-0.39 

0.288 

0.288 

af 

0.63 

0.000 

0.091 

"ll 

0.00 

0.000 

0.000 

"11 

0.00 

0.000 

0.000 

"ii 

0.00 

0.000 

0.000 

"ll 

0.00 

0.000 

0.000 

V(1)V(J) 
"ll 

0.73 

0.323 

0.323 

"11 

1.30 

0.303 

0.303 

"11 

1.64 

0.321 

0.321 

V(2)V(3) 
"ll 

1.56 

0.304 

0.304 

"11 

0.98 

0.327 

0.327 

VWvW 
11 

0.92 

0.226 

0.226 

0 

2.02 

0.134 

0.134 

e^ 

-0.38 

0.126 

0.126 

One  last  remark  worth  mentioning  is  with  regard  to  the  standard  error 
estimates  of  the  estimated  expected  cell  counts  {p'ijku}-  The  precision 


-118- 
estimates  will  be  different  for  the  two  sampling  schemes.      In  fact,   the 
relationship  (2.4.6),  viz. 

var(A(^))  =  var(/i(^>)  -  ®f^'     ^'      , 

allows  us  to  determine  how  different  the  two  variances  will  be.  For  example, 
the  estimated  expected  cell  count  for  cell  (1,1,1,1,1)  is  jj-uiu  =  232.80 
and  the  standard  errors  are  7.029  and  14.292  corresponding  to  the  product- 
multinomial  and  product-Poisson  sampling  assumptions.  The  difference  in 
standard  errors  is  substantial.  In  contrast,  the  estimated  expected  cell  count 
for  cell  (1, 1, 2, 2, 1)  is  /iii22i  =  4.09  and  the  two  standard  errors  are  1.324  and 
1.342.  The  product-Poisson  standard  error  estimate  is  only  slightly  inflated. 
Suppose  that,  instead  of  assuming  the  logit  model  (3.5.2)  for  the 
marginal  parameters,  we  used  the  equivalent  loglinear  model.  That  is,  we 
will  modify  the  matrices  C2,  A2  and  X2,  and  the  vector  /32,  so  that  the  logit 
model  is  equivalently  expressed  as  a  loglinear  model.  Let  C^  =  ©j/g)  ^2  ~  ^2 
(no  modification  is  necessary  for  this  example),  and 
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With  this  specification,  the  logit  model  is  equivalent  to  the  loglinear 
model 

M  :  logruiit-  fc)  =  A  +  Af  +  A^  +  A^/  +  A.^^  +  Af ,  (3.5.3) 

where  A^^  satisfies 

\RV  _  \RV  _  \RV  _   \RV.      \RV  _  n 

and  {mi{t]k)}  is  the  set  of  expected  marginal  counts.    That  is,  mi[t\k)  = 
nji.0i(<;  k).  The  vector  /^j  is  thus  defined  as 

P2  —  ('^5  -^1  )  -^1  5  A2  ,  A3  ,  Aji  ,  A21  ,  A31  ,  Aj     ,  Aj ) 

Notice  that  the  loglinear  model  (3.5.3)  includes  the  VS  effect.  This 
effect  must  be  included  so  that  the  model  is  well  defined.  We  will  discuss  this 
further  in  the  next  section,  section  3.6. 

The  matrix  Cj  is  not  a  contrast  matrix  for  the  loglinear  representation 
of  the  marginal  model.  Therefore,  to  determine  which  freedom  parameter 
estimators  axe  unaffected  by  the  sampling  assumption,  we  must  find,  among 
the  columns  of  X^,  the  minimal  spanning  set  for  M{@f^lm*)  =  M{@lli). 
Notice  that  the  number  of  response  functions,  within  each  population,  for 
the  marginal  model  is  now  rrij  =  ^2  ~  ^'  ^°^  52  =  4  as  it  was  for  the  logit 
model.  Again,  with  the  parameterization  we  have  chosen,  we  can  easily  see 
that  the  first  and  tenth  columns  of  X^  span  a  set  that  contains  the  range  space 
of  ffijlg.  Invoking  Theorem  2.4.2,  we  have  the  following  result.  Letting  the 
vector  (3  represent  the  freedom  parameter  vector  for  model  ((3.5.1)n(3.5.3)), 

var(^(^))  =  var(^(^))  -  A  =  var(^(^))  -  (^j;     ^:2)  , 
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where  the  elements  of  the  partitioned  matrix  A  are 


,11  -  /  0,        if 
'""l>0,     o1 


(fc,/)0{l,6}x{l,6} 
otherwise 


^12^(0,         if(fc,Z)y{l,6}x{l,10} 
*'       1  7^  0,     otherwise 


A2i_  fO,         if(A;,/)^{l,10}x{l,6} 

^"■'-{^0,  ot: 


*'       \  ^  0,     otherwise 


and 


^22^(0,  if(fc,/)y{l,10}x{l,10} 

*'       I  >  0,     otherwise. 

By  expressing  f3  =  vec(^i,  /32)  as  j3  =  {9i, 02,. . ., ^26)'?  we  can  state  the 
result  in  another  way:  If  {i,j)  0  {1, 6, 17, 26}  x  {1,6, 17, 26}  then  cov(^i,  6j) 
is  the  same  under  both  sampling  assumptions.  If  {i,j)  is  in  the  set  then  the 
covariances  may  be  diiferent. 

To  illustrate,  we  compare  the  standard  errors  for  the  loglinear  parameter 
estimators.  It  happens  that  all  of  the  freedom  parameter  estimators  axe  the 
same  (see  Theorem  2.4.1)  and  all  of  the  standard  errors  are  the  same  except 
those  associated  with  the  1**,  6*'*,  17*'',  and  26*'*  parameters,  namely  a,  af,  A, 
and  Af .  For  these  four,  the  standard  error  estimates  were  related  as  follows 

se(a|Poisson)  =  se(Q:|multinomial)  +  0.012 
se(Q;f  |Poisson)  =  5e(a;f  |multinomial)  +  0.091 
se(A|Poisson)  =  5e(A|multinomial)  +  0.016 
5e(Af  iPoisson)  =  se(Af  |multinoniial)  +0.091. 
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In  summary,  we  were  able  to  easily  determine  when  inferences  using 
certain  freedom  parameter  estimators  would  be  the  same  under  both  sampling 
schemes.  This  holds  for  a  very  broad  class  of  generalized  loglinear  models  of 
the  form  ClogA/i  =  X/3.  Basically,  if  the  matrix  C  is  a  contrast  matrix, 
that  is  both  Ci  and  C2  are  contrast  matrices,  all  of  the  inferences  are  the 
same.  On  the  other  hand,  if,  for  example,  Ci  of  C  is  an  identity  matrix 
then  we  must  look  at  the  design  matrix  Xi  to  determine  which  columns  form 
a  minimal  spanning  subset  for  the  range  space  of  some  matrix  of  the  form 
Q^lrrii-  When  Cj  is  an  identity  matrix,  m^  =  qi  is  the  number  of  response 
functions,  within  each  population  (or  level  of  covariate),  that  are  modeled  via 
dlog  Aii^  =  Xifii. 

3.6     Weil-Defined  Models  and  the  Computation  of 
Residual  Degrees  of  Freedom 

We  made  some  remarks  above  with  regard  to  models  being  well  or  ill  de- 
fined. To  illustrate,  we  use  the  simple  example  in  which  the  joint  distribution 
model  is  J{SY)  and  the  marginal  distribution  model  is  M{MH).  We  stated 
that  the  model  J{SY)  n  M{MH)  is  ill  defined  since  the  constraints  implied 
by  the  symmetry  model  J{SY),  namely  that  the  marginal  distributions  are 
equal,  are  the  model  constraints  of  M[MH).  We  will  show  that,  for  the 
one  population  setting,  as  long  as  the  main-effects  loglinear  parameters  are 
allowed  to  be  arbitrary  (up  to  freedom  parameter  identifiability  constraints) 
the  joint  distribution  model  will  only  imply  that  the  expected  marginal  counts 
satisfy  the  (multinomial)  identifiability  constraints.  In  all  other  respects  the 
expected  marginal  counts  are  allowed  to  be  arbitrary  positive  numbers.  That 
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is,  the  joint  distribution  model  and  the  marginal  distribution  model  will 
not  include  redundant  constraints  and  the  simultaneous  model  will  be  well 
defined.    For  this  example,  J{SY)  restricts  the  main-effects  parameters  to 
satisfy 

Evidently,  the  sufficient  condition  for  the  model  to  be  generally  well  defined 
is  not  met.  We  also  discuss  sufficient  conditions  for  a  simultaneous  model  to 
be  well  defined  when  there  are  covariates  present. 

A  simultaneous  model  will  necessarily  be  well  defined  if  the  following 
three  conditions  hold:  The  joint  distribution  model  must  be  well  defined.  The 
marginal  distribution  model  must  be  well  defined.  And,  the  joint  distribution 
model  must  only  constrain  the  expected  marginal  counts  to  satisfy  the 
identifiability  constraints.  The  first  two  conditions  hold  whenever  the  models 
do  not  contain  redundant  and/or  conflicting  constraints;  the  identifiability 
constraints  being  included.  For  example  if  one  covariate  is  present,  as  long 
as  the  generalized  loglinear  portion  of  the  model  allows  for  a  perfect  fit  to 
the  sums  of  expected  counts  within  each  level  of  the  covariate,  the  model  will 
be  well  defined.  In  what  follows  we  consider  the  two  response,  one  covariate 
case  to  illustrate  how  one  can  identify  a  large  class  of  simultaneous  models 
that  will  be  well  defined.  The  extension  to  arbitrary  numbers  of  responses 
and  covariates  is  straightforward. 

Suppose  that  A  and  B  are  two  response  variables.  We  will  initially 
allow  the  number  of  response  categories  for  A  and  B,  namely  /  and  J,  to 
be  different.    Since  this  chapter  deals  with  situations  when  the  responses 
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are  measured  on  the  same  scale  (i.e.  I  =  J),  we  will  also  address  the 
sufficient  conditions  for  model  well  definedness  in  that  case.  Denote  the  K 
level  covariate  by  P.  The  following  lemma  identifies  a  large  class  of  joint 
distribution  models  that  only  imply  that  the  expected  marginal  counts  satisfy 
the  identifiability  constraints.  It  is  important  to  point  out  that  we  will  be 
referring  to  two  types  of  identifiability  constraints.  'Identifiability'  constraints 
are  those  constraints  associated  with  multinomial  sampling,  namely  that  cer- 
tain sums  of  probabilities  add  up  to  1.  'Freedom  identifiability'  constraints  are 
those  constraints  that  are  necessary  to  ensure  that  each  freedom  parameter  in 
the  model  is  estimable.  The  identifiability  constraints  for  /z  will  generically  be 
labelled  as  ident[fj,)  in  this  section.  Similarly,  let  the  identifiability  constraints 
for  m,  the  vector  of  expected  marginal  counts,  be  denoted  by  ident(m).  These 
constraints  are  implied  by  ident^iJ,). 

Lemma  3.6.1.    Let  the  hierarchical  loglinear  model  (AP,BP)  he  specified  as 
either 

logH  =  X*l3*,    ident(fi),       or       U*logfi  =  0,    identdj,). 

Suppose  that  the  joint  distribution  model  [0j]  can  be  specified  as  either 

log/i  =  X/3,    ident^jj),       or       ?7'log/^  =  0,    ident{fj,). 

If  [0j]  is  no  more  restrictive  than  (AP,BP)  in  the  sense  that 

M{X)  D  M{X*)        or        M{U)  C  M{U*), 

then  [Qj]  only  constrains  the  expected  marginal  counts  to  satisfy  the  identifi- 
ability constraints  identijn). 
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Proof:    Write  the  model  {AP,  BP)  as 

log  tJ-ijk  =  a  +  af  +  a,^  +  af  +  ocf,^  +  af/ , 
where  without  loss  of  generality  the  freedom  identifiability  constraints  are 

af  =  a^  =  af  =  af/'  =  aff  =  af/  =  aff  =  0,    Vi,  j,  k, 
and  the  identifiability  constraints  ident^fx)  are 

»     i 
Using  the  identifiability  constraints  we  can  write 

"fc  =  exp(a  +  af)7;^7f,    k  =  l,...,K, 


where 


I 

J 


Hence, 


a  +  a^  =  logUk  -  \oglt  -  ^o^lk 
Now  all  of  the  freedom  parameters  not  constrained  by  the  freedom  identifi- 
ability constraints  or  the  identifiability  constraints  are  completely  arbitrary. 
It  follows  that  {7j^}and  {"f^},  which  are  functions  of  these  arbitrary  freedom 
parameters,  are  also  completely  arbitrary. 

Therefore, 

J 

=  exp(log  Uk  -  log  7j^  -  log  ji  +  af  +  afk^)j^ 


=  exp(lognfc  -  log 7^  +  af  +  af/) 
_  nfcexp(a,^+a^^^) 
Ik 
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That  is,  this  set  of  expected  marginal  counts  follows  a  saturated  multinomial 
loglinear  model.  Similarly, 

nfcexp(af +a^/)  rut  tt 

follow  a  saturated  multinomial  loglinear  model.  Since  the  two  sets  of  expected 
marginal  counts  are  functions  of  different  arbitrary  parameters  we  have  that 
the  entire  set  of  expected  marginal  counts  are  constrained  only  to  satisfy  the 
identifiability  constraints  ident^m),  viz. 

/  J 

V^mi(l, A;)  =  njfe,    and    V]mj(2, A;)  =  n;fc,    k  =  l,...jK. 

Now,  if  any  joint  distribution  model  is  less  restrictive  in  the  sense  stated  in  the 
lemma,  it  must  be  that  the  model  must  only  constrain  the  expected  marginal 
counts  to  satisfy  identirn).  This  is  what  we  set  out  to  show.  g 

As  a  special  case,  suppose  that  the  covariate  P  has  just  one  level, 
i.e.  K  =  1.  Lemma  3.6.1  tells  us  that  a  sufficient  condition  for  the  joint 
distribution  model  to  only  constrain  the  expected  marginal  counts  to  satisfy 
ident{m)  is  that  the  main-effects  parameters  {af}  and  {a^}  be  arbitrary  up 
to  the  freedom  identifiability  constraints.  In  fact,  for  the  case  7  =  J,  in  view 
of  the  proof  of  the  lemma,  if  we  constrained  the  main-effects  parameters  to 
satisfy 

then  expected  marginal  counts  would  be  constrained  to  satisfy  marginal 
homogeneity.  Another  generalization  of  Lemma  3.6.1  involves  the  situation 
when  there  is  more  than  one  covariate.  If  there  was  more  than  one  covariate, 
say  P  and  Q,  then  the  joint  distribution  model  should  be  no  more  restrictive 
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than  the  hierarchical  loglinear  model  {APQ,BPQ)  for  the  conclusion  of 
Lemma  3.6.1  to  hold. 

Since  most  reasonable  joint  distribution  models  will  be  well  defined  we 
assume  this  to  be  the  case  and  hence  are  left  to  show  that  the  marginal 
distribution  model  is  well  defined.  To  show  this,  we  simply  must  show 
that  the  generalized  loglinear  or  linear  marginal  model  constraints  and 
the  identifiability  constraints  ident{m)  (which  are  implied  by  ident[fi))  are 
independent.  We  will  initially  assume  that  /  need  not  equal  J.  Let  the  factors 
Ri  and  R2  represent  the  level  of  response  to  factors  A  and  B.  That  is,  Ri 
is  an  I  level  factor  and  i22  is  a  J  level  factor.  A  simple  loglinear  model  for 
the  expected  marginal  counts  can  be  written  as  ((i^i,  P),  {R2,  P))-  What  this 
means  is  that  the  expected  marginal  counts  satisfy 

\ogmi{l,k)  =  (3'+^^^+Pf,    i^l,...,I,k  =  l,...,K 
\ogmj{2,k)  =  p'+^f'+(3l^,    j  =  l,...,J,  k  =  l,...,K  (3.6.1) 

^f  1  =  /5f '  =  (31^  =  (3jP  =  0,    ident{m). 

Suppose  now  that  I  =  J.  As  before,  let  the  factor  R  represent  the 
common  levels  of  response  for  both  response  factors  A  and  B.  Also,  the 
factor  V  will  again  be  defined  to  be  the  response  variable  factor.  For  this 
example,  F  is  a  two- level  factor  taking  on  the  values  1,  corresponding  to 
the  'first'  response  .A,  and  2,  corresponding  to  the  'second'  response  B.  For 
longitudinal  data,  V  is  referred  to  as  the  'Occasion'  variable.  Since  I  =  J  we 
can  consider  an  even  simpler  model.  We  could  assume  that 
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and  consider  the  model  (iZ,  VP),  which  can  be  specified  as 

\ogmi{t,k)  =  r  +  r^  +  r,^  +  T[  +  r^/,    t  =  l,2,  i  =  l,...,I,  k  =  l,...,K, 

(3.6.2) 
where 

r^+rY/=i3f,    t  =  l,2,k  =  l,...,K, 
the  T  parameters  satisfy  the  freedom  constraints, 

T-V   _       P  _       VP  _       VP  _  Q        yf    u 
Tj     —  Tj     —  Tjjj.      —  T^i       —  U,      VI,  K, 

and  the  identifiabihty  constraints  ident{m)  are  satisfied.  Notice  that  the 
model  (i2,  VP)  only  makes  sense  when  /  =  J;  it  implies  marginal  homogeneity 
of  the  A  and  B  response  distributions.  The  following  lemma  provides  us  with 
a  way  of  identifying  a  large  class  of  marginal  distribution  models  that  are  well 
defined.  It  is  concerned  with  the  case  when  I  need  not  equal  J.  Lemma  3.6.3 
applies  when  I  =  J.  Each  of  these  lemmas  is  easily  generalizable  to  situations 
when  there  are  many  response  variables  and  many  covariates. 

Lemma  3.6.2  Suppose  that  the  marginal  distribution  model  ((iZj,  P),  {R2,  P)) 
can  be  written  as  either 

logm  =  X*/5*,  ident[m)       or       U*  logrn^O,  ident(m), 

where  ident{m)  are  those  identifiability  constraints  implied  by  ident{iJ,). 
Specify  the  marginal  distribution  m,odel  [Qm]  <^s 

\ogm  =  Xf3,  ident{m)     or       U'logm  =  0,  ident{m). 
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7/ [0jif ]  is  no  more  restrictive  than  ((i2i,P),  (iJjj-P))  in  the  sense  that 

M{X)  D  M{X*)     or       M{U)  C  M{U*) 

then  [©m]  ^-5  well  defined. 

Proof:   By  equation  (3.6.1),  the  marginal  model  {{Ri,P),  (i?2,-P)),  without 
the  identifiability  constraints,  implies  that 


5(1,  fc)  ^  5;m,(l,fc)  =  exp(^i  +/5i^)  j;exp(/3fO    and 

i=l  t=l 

5(2,  k)^J2  m,(2,  k)  =  exp(/52  +  /3|^)  j^  exp(/3f ). 


Hence,  the  s{t,k),  which  are  functions  of  2  *  iiT  arbitrary  parameters,  are 
arbitrary.  Since  the  identifiability  constraints  ident{m)  constrain  the  s(<,  k)  to 
satisfy  s(i,  fc)  =  n^,  fc  =  1, . . . ,  iiT,  t  =  1, 2  and  the  model  constraints  allow  the 
s{t,k)  to  be  completely  arbitrary,  it  follows  that  the  model  ((i2i,P),(i22,P)) 
is  well  defined.  Also,  any  less  restrictive  marginal  distribution  model  will  also 
be  well  defined.  _ 

Notice  that  in  the  proof  of  Lemma  3.6.2  the  conclusion  would  still  hold 
if  the  sums  Y,i=i  ^xp(A^O  and  Y,j-i  exp(/3j^*)  were  constrained  to  equal  each 
other.  This  will  be  important  when  we  show  that  the  model  (R,  VP)  is  well 
defined. 

Suppose  now  that  I  =  J  so  that  the  model  [R,  VP)  is  reasonable.  This 
next  lemma  identifies  a  large  class  of  marginal  distribution  models  that  are 
well  defined  when  the  responses  are  measured  on  the  same  scale. 
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Lemma  3.6.3   Suppose  that  the  model  {R,  VP)  can  be  written  as  either 

\ogm  =  X*(3*^  ident{m)       or       U*  logm  =  0,  ident[m). 

Specify  the  marginal  distribution  model  \Qm\  <^^ 

logTn  =  X/3,  ident{rn)       or       ?7'logTn  =  0,  ident[m). 

If  [©m]  is  ^0  more  restrictive  than  (i2,  VP)  in  the  sense  that 

M{X)  D  M{X*)     or     M{U)  C  m{U*) 

then  it  is  well  defined. 

Proof:    By  equation  (3.6.2),  we  can  write  the  sums  s{t,k)  =  l^j"ii(<,fc)  as 

s{t,  k)  =  exp(r  +  rr  +  r[  +  r^ )  E  ^Mr^")- 

i 

Notice  that  the  first  exponential  term  is  completely  arbitrary;  it  is  a  function 
of  2  *  iiT  independent  parameters.  Therefore  the  set  of  sums  {s(t,k)}  is  not 
constrained  in  any  way  by  the  model  constraints,  logm.  =  X*j3*.  As  in  the 
proof  of  Lemma  3.6.2,  it  follows  that  the  marginal  distribution  model  (i2,  VP) 
is  well  defined.  Finally,  any  less  restrictive  model  will  also  be  well  defined,  g 
In  view  of  the  proof  of  Lemma  3.6.3,  the  model  {R,  V,P)  would  not  be 
well  defined;  neither  would  {RV,P).  In  order  for  the  marginal  distribution 
model  to  be  well  defined  the  loglinear  model  must  include  the  VP  effect.  We 
can  easily  generalize  the  results  of  Lemma  3.6.3.  Suppose  that  there  are  two 
covariates,  say  P  and  Q.  It  can  be  shown  that  any  marginal  distribution 
model  that  is  no  more  restrictive  than  the  loglinear  model  (i2,  VPQ)  is  well 
defined.  A  marginal  distribution  model  that  is  specified  as  a  cumulative-  or 
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adjacent  categories-logit  model  would  be  well  defined  if  the  model  allows  the 
sums  {s{t,k)}  to  be  completely  arbitrary. 

We  now  state  an  important  theorem  that  addresses  the  issue  of  model 
well  definedness.  The  theorem  is  specifically  for  the  case  when  the  response 
variables  A  and  B  are  measured  on  the  same  scale  and  there  is  just  one 
covariate  P.  It  can  easily  be  generalized  to  the  case  of  several  distinct 
responses  and  several  covariates. 

Theorem  3.6.1  Suppose  that  the  joint  distribution  model  [Qj]  is  no 
more  restrictive  than  the  loglinear  model  {AP,BP)  and  that  the  marginal 
distribution  m,odel  \Qm\  ^s  no  more  restrictive  than  the  loglinear  model 
{R,VP).  It  follows  that  the  simultaneous  model  [0j  n  ©ji^ ]  is  well  defined. 

Proof:  The  proof  follows  immediately  by  Lemmas  3.6.1  and  3.6.3  and 
the  fact  that  a  simultaneous  model  is  well  defined  if  the  following  conditions 
hold:  Both  the  joint  and  marginal  distribution  models  are  well  defined  and 
the  joint  distribution  model  only  constrains  the  expected  marginal  counts  to 
satisfy  the  identifiabiUty  constraints  ident{rn).  H 

A  few  remarks  about  Theorem  3.6.1  are  in  order.  Firstly,  when  there  is 
only  one  population  of  interest  the  sufficient  condition  is  that  the  main-effects 
parameters  are  allowed  to  be  arbitrary.  It  follows  that  such  models  as  quasi 
symmetry  ( /(QSY"))  satisfy  these  sufficient  conditions.  Also,  models  such  as 
J{UA{G))  and  J (17 A)  of  the  cross-over  example  satisfy  the  conditions.  This 
follows  since  the  model  J{UA)  is  equivalent  to  the  model  J{L  x  L)  which 
satisfies  the  sufficient  conditions  of  the  theorem;  it  is  less  restrictive  than 
(l/(i)G,F(2)G). 
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For  the  example  of  section  3.5,  we  see  that  had  we  left  the  effect  V^^ 
out  of  the  marginal  loglinear  model  (3.5.3),  the  marginal  model  would  have 
constrained  the  sums  {s[t,k)}  to  lie  in  some  restricted  space.    This  can  be 
seen  by  noting  that 


5(<,fc)  =  5]exp(/?  +  ^f  +  A^+iSf  +  AD 

=  exp  03  +  I3Y  +mjl  exp  ( A^  +  A^) 

1 

and  that  neither  exp{fi+(3Y+^§)  or  Yli  ^^p{^^^+^i^)  is  completely  arbitrary; 
s(t,k)  is  constrained  to  satisfy  s(t,k)  =  tttPk  for  some  Kt  and  p^.  Therefore, 
the  marginal  model  constraints  and  the  identifiability  constraints  are  not 
independent.  That  is,  model  ((3.5.1)  n  (3.5.3))  would  not  be  well  defined  if 
the  effect  VS  were  not  included  in  (3.5.3).  This  also  follows  directly  from 
Theorem  3.6.1.  Using  the  program  'mle. restraint',  an  attempt  was  made  to 
fit  the  ill-defined  model.  The  algorithm  did  not  converge.  In  practice,  this 
nonconvergence  could  very  well  indicate  that  the  model  is  ill  defined  (see 
section  2.5)  as  it  did  in  this  example. 

If  a  simultaneous  model  is  well  defined  it  follows  that  the  residual  degrees 
of  freedom  can  be  computed  as 

dfres[Qj  n  Qm\  =  dfres[ej]  +  d/re^QAf].  (3.6.3) 

since  the  model  constraints  are  nonredundant.  For  example,  the  residual 
degrees  of  freedom  for  measuring  goodness  of  fit  of  the  simultaneous  model 
J{L  X  L  +  D)  n  M{U)  used  in  the  political  interest  data  example  can  be 
computed  in  this  way.  This  follows  since  the  model  J{L  x  L  +  D)  satisfies 
the  sufficient  conditions  of  Theorem  3.6.1  and  so,  if  M{U)  is  well  defined,  the 
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simultaneous  model  J{L  x  L  +  D)  n  M{U)  is  well  defined.  In  contrast,  the 
model  J{y^^\  y(^),  G)  used  for  the  cross-over  data  example,  does  not  satisfy 
the  conditions  of  the  theorem  since  the  effects  V(^)G  and  F(2)(^  gj-g  omitted. 
In  fact,  the  model  implies  that  there  is  no  Group  (G)  by  Response  level 
(R)  association.  Therefore,  the  simultaneous  model  comprised  of  this  joint 
distribution  model  along  with  the  marginal  cumulative-logit  model  M(V)  is 
ill  defined  since  M[V)  implies  the  same  constraints.  Equation  (3.6.3)  does 
not  apply  in  this  case. 

3.7     Discussion 

In  this  chapter,  we  introduced  a  broad  class  of  models  that  imply  struc- 
ture on  both  the  joint  and  marginal  distributions  of  multivariate  categorical 
response  vectors  when  the  response  scale  was  the  same  for  each  response.  We 
showed  that  these  models  can  be  fit  using  the  ML  fitting  method  of  Chapter 
2.  Several  numerical  examples  were  considered,  illustrating  the  usefulness 
of  simultaneously  modeling  the  joint  and  marginal  distributions.  All  of  the 
models  were  fitted  using  the  FORTRAN  program  'mle. restraint',  which  was 
developed  by  the  author. 

Model  parsimony  was  the  impetus  behind  this  entire  chapter.  Our 
objective  was  to  find  parsimonious  models  that  both  fit  the  data  well  and 
provided  us  with  straightforward  interpretations  of  freedom  parameters.  The 
models  often  included  parameters  that  measured  departures  from  indepen- 
dence among  the  responses,  as  well  as  parameters  that  measured  departure 
from  marginal  homogeneity.  It  was  shown,  via  a  numerical  example,  that 
parsimonious  modeling  may  result  in  more  efficient  and  reliable  estimation 
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of  both  model  and  freedom  parameters,  the  researcher  must  find  a  balance 
between  a  model  that  is  too  structured  and  one  that  is  not  structured  enough. 
The  author  fully  intends  to  conduct  simulation  studies  to  better  understand 
the  importance  of  parsimonious  modeling  in  this  setting. 

Although  we  provide  somewhat  general  results  regarding  compatibility 
of  the  joint  and  marginal  models,  there  still  is  a  need  for  more  general  results. 
We  discuss  the  case  when  the  joint  and  marginal  models  can  be  expressed, 
at  least  equivalently,  as  loglinear  models.  More  general  results  are  needed  for 
other  types  of  models,  such  as  cumulative-logit  and  linear  models.  For  these 
simultaneous  models  to  be  useful  to  the  practitioner,  a  general  method  to 
determine  whether  the  constraints  implied  by  the  two  models  are  independent 
must  be  developed.  The  proposition  in  section  3.6  is  a  step  in  the  right 
direction. 

A  factor  that  could  impede  the  use  of  this  method  to  fit  models  to  very 
large  data  sets  is  the  input  requirements.  The  algorithm  requires  a  substantial 
amount  of  input.  For  example,  consider  the  input  required  for  the  example 
in  section  3.5.  The  matrices  C,  A,  and  X  all  naust  be  input.  Although  the 
required  input  is  simple  to  determine,  there  is  much  energy  expended  inputing 
the  information.  An  input  program  must  be  developed  and  implemented  in 
the  program  'mle.restraint'. 

The  assessment  of  model  goodness  of  fit  is  straightforward  when  using 
the  ML  method.  The  (log)  likelihood-ratio  statistic  G^,  the  Pearson  statistic 
X^,  or  the  Wald  statistic  W^  can  be  used  for  this  purpose.  Of  interest  to 
the  practicing  statistician,  is  the  ability  to  assess  how  far  wrong  you  can 
be  by  assuming  that  the  responses  are  independent.  The  test  statistic  used 


-134- 
for  this  purpose  is  simply  the  UkeHhood-ratio  statistic  that  measures  how 
'far  apart'  the  models  J{I)  n  M{U)  and  J{S)  n  M{U)  are.  Because  the 
model  J{I)  n  M{U)  is  nested  within  the  model  J{S)  n  M{U),  one  can  use, 
as  a  measure  of  this  distance,  the  difference  between  the  two  likelihood-ratio 
statistics,  viz.  G^[J{I)nM{U)]  -  G^[J{S)nM{U)].  More  generally,  there  are 
many  assumptions  one  can  make  about  the  association  structure  among  the 
responses.  With  the  methods  of  this  dissertation,  one  can  easily  derive  tests 
for  the  validity  of  the  assumptions. 

As  an  alternative  to  longitudinal  type  sampling  designs,  a  cross-sectional 
sample  may  be  taken.  Cross-sectional  sampling  involves  sampling  indepen- 
dent groups  of  subjects  for  each  response.  The  research  questions  posed  about 
the  marginal  distributions  are  such  that  they  could  by  answered  using  cross- 
sectional  data.  In  this  sense,  the  marginal  models  are  'population  averaged' 
models  (Zeger  et  al.,  1988).  However,  a  cross-sectional  sampling  design 
results  in  more  subject  variability,  since  nonhomogeneous  subjects  are  used  for 
each  response,  and  the  detection  of  differences  in  the  marginal  distributions 
may  be  clouded  by  these  subject  effects  (Laird,  1991).  Further,  with  cross- 
sectional  studies,  we  axe  unable  to  explore  the  association  structure  among 
the  responses.  This  information,  regarding  the  association  structure,  may  be 
of  substantive  importance  in  some  situations. 


CHAPTER  4 
LOGLINEAR  MODEL  FITTING  WITH  INCOMPLETE  DATA 


4.1     Introduction 

We  consider  making  inferences  about  loglinear  model  parameters  when 
only  disjoint  sums  of  the  complete  data  are  observed.  Inferences  will  be  made 
based  on  the  maximum  likelihood  estimates  of  the  model  parameters  and  an 
estimate  of  precision  of  these  estimates.  As  an  example,  consider  the  data  in 
Table  1  of  Goodman  (1974).  Each  of  216  respondents  was  classified  as  being 
universalistic  or  particularistic  when  confronted  by  each  of  four  situations 
{A,B,C,D)  of  role  conflict.  Goodman  (1974)  postulated  the  presence  of  an 
underlying  two-level  latent  factor  W  which  was  not  observed.  Within  a  level 
of  the  latent  factor  the  manifest  variables  [A,B,C,D)  are  assumed  to  be 
mutually  independent.  Thus,  the  latent  class  structure  would  allow  us  to 
simply  explain  the  relationship  among  the  four  manifest  variables.  In  this 
setting  the  unobservable  complete  data  are  the  counts  resulting  from  a  cross- 
classification  on  the  four  manifest  factors  and  the  latent  factor.  The  data, 
if  observable,  could  be  displayed  in  a  2^  contingency  table.  The  observable 
incomplete  data  are  the  counts  obtained  by  summing  over  the  two  levels  of 
the  latent  factor,  i.e.  the  incomplete  data  are  disjoint  sums  of  the  complete 
data.  As  in  Goodman  (1974),  we  assume  the  complete  data  means  follow  a 
loglinear  model  which  implies  conditional  independence  among  the  manifest 
factors  {A,  B,C,D)  given  the  latent  factor  W.  Our  objectives  include  finding 
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the  maximum  likelihood  estimates  of  the  loglinear  parameters  based  on 
the  observed  data,  estimating  their  precision,  computing  other  model  based 
estimators  and  their  standard  errors,  and  testing  model  goodness  of  fit. 

There  are  many  ways  to  find  the  maximum  likelihood  estimators,  each 
method  having  its  positive  and  negative  features.  For  example,  we  could  work 
directly  with  the  incomplete-data  likelihood,  which  is  usually  complicated 
relative  to  the  complete-data  likelihood,  and  use  a  Newton-Raphson  or  Fisher- 
scoring  algorithm.  Palmgren  and  Ekholra  (1987)  and  Haberman  (1989)  use 
these  methods  to  obtain  maximum  likelihood  estimates  and  their  standard 
errors.  We  could  avoid  the  complicated  likelihood  altogether  and  use  the 
Expectation-Maximization  algorithm  (Dempster  et  al.,  1977).  Sundberg 
(1976)  discusses  the  properties  of  the  EM  algorithm  when  it  is  used  to  fit 
models  to  data  coming  from  the  regular  exponential  family.  In  section  4.2 
the  EM  algorithm  is  explored  in  greater  detail. 

Unlike  the  other  approaches,  the  EM  algorithm  is  insensitive  to  starting 
values.  This  is  important  in  practice  since  we  seldom  have  any  idea  what 
a  reasonable  starting  value  is.  Another  positive  feature,  not  shared  by  the 
other  methods,  is  that  the  convergence  to  the  maximum  is  monotonic,  i.e. 
the  likelihood  is  increased  at  each  successive  iteration.  Drawbacks  to  the  EM 
algorithm  are  that  (1)  it  is  relatively  slow  and  (2)  an  estimate  of  precision 
of  the  parameter  estimate  is  not  obtained  as  a  by-product  of  the  algorithm. 
N-R  and  Fisher-scoring,  on  the  other  hand,  are  faster  and,  as  a  by-product, 
provide  us  with  an  estimate  of  precision.  The  slow  convergence  of  the  EM 
algorithm  can  be  mitigated  somewhat  using  the  acceleration  methods  of 
Meilijson  (1989)  or  Louis  (1982).    Also,  increased  computer  efficiency  has 
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made  the  slow  convergence  less  of  an  issue.  In  section  4.3.2  we  address 
the  second  drawback  of  the  EM  algorithm  by  deriving  an  explicit  form  for 
the  observed  information  matrix  when  the  complete  data  axe  independent 
Poissons  with  means  following  a  loglinear  model.  The  observed  information 
matrix  is  computed  upon  convergence  of  the  EM  algorithm  and  then  inverted. 
The  inverse  will  serve  as  the  estimate  of  precision.  In  section  4.5  we  explore 
an  iterative  scheme  that  uses  both  NR  and  EM,  exploiting  each  of  their  strong 
points. 

4.2     Review  of  the  EM  Algorithm 

The  EM  algorithm  is  generally  used  in  those  estimation  problems  in 
which  the  likelihood  is  complicated,  rendering  it  difficult  or  impractical  to 
maximize,  but  in  which  the  data  can  be  viewed  as  being  some  function 
of  complete  data  which,  had  they  been  observed,  evaluation  of  maximum 
likelihood  estimates  would  be  simple.  Unlike  many  other  statistical  root- 
finding  algorithms,  the  EM  algorithm  does  not  require  explicit  calculation  of 
the  score  vector  or  its  derivative.  It  uses  much  simpler  functions. 

The  EM  algorithm  is  by  no  means  a  new  method  for  finding  maximum 
likelihood  estimates.  Goodman  (1974)  essentially  used  it.  Sundberg  (1976) 
discusses  it  at  length  when  used  in  the  exponential  family  case.  Dempster, 
Laird,  and  Rubin  (1977)  provide  us  with  a  review  of  the  method  as  well  as 
some  of  its  properties.  Subsequent  work  with  the  EM-algorithm  has  been 
primarily  devoted  to  improving  the  speed  of  its  convergence  (Louis,  1982; 
Meilijson,  1989). 
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4.2.1     General  Results 

Suppose  the  complete  data  X  has  density  fx{^i  ^)  with  respect  to  some 
measure.  Let  Y  =  Y{X),  a  function  of  the  complete  data,  denote  the  observed 
data.  It  follows  that  the  density  of  Y  is 

fY{y;0)=  I  fx{x-9)dv{x),  (4.2.1) 

Jr 

where  R  =  {x  :  Y[x)  =  y}  and  u  is  some  appropriate  measure.  Since  y  is  a 
function  of  X,  the  joint  density  of  X  and  Y  can  be  written  as 

Hence,  the  conditional  density  of  X  given  Y  =  y  is 

Therefore,  the  log  likelihood  based  on  Y  is 

iY{e;y)  =  \ogfY{y;e)  =  \ogfxix;e) -log  fxiY{x;y,e). 

Taking  the  conditional  expectation  (given  Y"  =  y)  at  Oo  gives  us 
iY{e;y)  =  E{iY{0;y)\Y  =  y,eo) 

=  E{£x{0;  X)\Y  =  y,  60)  -  E{ix\Y{0,  y,  X)\Y  =  y,  Oo) 

^^Qie,d,,y)-H{e,eo,y). 

The  EM  algorithm  is  defined  by 

g(^(">+i),^M,y)  =  maxQ(^,^("»),y),  (4.2.3) 


e 


i.e.  given  the  m"*  iterate  estimate  of  6,  9^"^\  the  next  iterate  is  that  value  of 
6  that  maximizes  Q{9,6^'^\y). 
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The  following  properties  of  the  EM  algorithm  are  verified  in  the 
appendix.  The  proofs  follow  from  Dempster  et  al.  (1977)  and  Louis  (1982). 
In  what  follows  5  denotes  a  score  vector  and  I  an  information  matrix. 

Property  1: 

If  ^('"^  and  ^("*+^)  are  the  m^^  and  m  + 1**  iterate  estimates  obtained  via 
the  EM  algorithm  then 

£y(^(-+l);y)>^^(^(-);y), 

i.e.  the  log  likelihood  is  increased  at  each  successive  iteration. 

Property  2: 

The  sequence  of  EM  iterates  {9^"^\m  >  1}  satisfy,  whenever  ^('") 
converges  to  9^°°^  as  m  -+  oo, 

pY{0;y)\^^,=Sy{e(-');y)  =  O. 

i.e.  the  estimates  converge  to  a  zero  of  the  score  vector  for  Y. 

Property  3: 

For  any  9q, 

^[Q{0, 9o,y)\e,]  =  SY{9o;y)  =  E{Sx{9o;  X)\Y  =  y,  9o). 

Property  4: 

For  any  9q, 

lYi9o; y)  =  EiIxi9o;  X)\Y  =  y, ^o)  -  var(5;,(^o;  X)\Y  =  y, 9o). 

Briefly,  property  1  implies  that  the  incomplete-data  likelihood  is  in- 
creased with  each  successive  iteration,  property  2  says  that  the  EM  algorithm 
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can  be  used  to  find  a  zero  of  the  incomplete-data  score  function,  property  3 
provides  us  with  a  way  of  evaluating  the  score  function  (see  Meilijson,  1989), 
and  finally  property  4  gives  us  an  expression  for  the  observed  information 
matrix  based  on  the  incomplete  data.  These  four  properties  of  the  EM 
algorithm  will  be  explored  in  detail  in  the  next  section  which  deals  with 
the  special  case  in  which  the  complete  data  have  distribution  in  the  regular 
exponential  family. 

4.2.2     Exponential  Family  Results 

The  exponential  families  of  distributions  play  an  important  role  in  statis- 
tical inference.  Many  data  generating  mechanisms  can  be  modeled  assuming 
that  the  underlying  distribution  is  a  member  of  the  regular  exponential  family. 
In  this  section  we  consider  properties  of  the  regular  exponential  family  that 
are  relevant  to  the  use  of  the  EM  algorithm.  Specifically,  we  will  make  use 
of  the  results  of  this  section,  which  are  due  primarily  to  Sundberg  (1974),  to 
justify  results  for  Poisson  loglinear  models  with  missing  data. 

Let  the  complete  data  vector  X  have  density,  with  respect  to  some 
measure,  in  the  regular  exponential  family.  That  is  assume  that 

fx{x;(3)  =  a{x)exp{T'{x)l3  -  c(/3)),  (4.2.4) 

where  T(x)  =  (Ti(x), T2(x), . . . ,Tp[x))'  ajid  /3  is  a  canonical  parameter  vector 
of  length  p.  Let  X  =  {x  :  fx{x]^)  >  0}. 

Some  well  known  properties  of  the  regular  exponential  family  include 

1.  T{X)  is  sufficient  for  /3 

2.  ^  =  E,{T{X))    and  (4,2.5) 
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These  properties  of  (4.2.5)  are  shown  in  Lehmann  (1983,  pp.    29,30).    The 
properties  follow  immediately  upon  repeated  differentiation  of  /^  fxi^'i  l^)d^{x) 
with  respect  to  ^.  Lehmann  (1983)  showed  that  the  derivative  could  be  passed 
through  the  integral. 

Suppose  that  the  incomplete  data  vector  Y"  is  a  (many  to  one)  function 
of  X,  i.e.  Y  =  Y{X).  For  notational  convenience,  we  let  t  =  T{x)  and  Ir{x) 
represent  the  indicator  of  membership  in  R  =  {x  :  Y{x)  =  y}.  It  follows  by 
equation  (4.2.2)  that 

f      i^.,,  ^S-fx{x;/3).lR{x)  _   a{x)exp{t'l3-c{(3))-lR{x) 
JxiY^x,y,p)  ^^^^,^^  J^a{x)expit'l3-cmdu{x) 

=  a{x)  exp{t'(3  -  c*{f3;  y))  •  Ir{x)  =  a*{x)  exp{t'(3  -  c*{(3-  y)), 

(4.2.6) 

where  a*(x)  =  a(x)  •/jj(x)  and  c*{(3;y)  =  logJj^a{x)exp{t'P)du{x).  Hence  the 

conditional  distribution  of  X  given  y  =  y  is  also  a  member  of  the  exponential 

family  (Sundberg,  1974).   Again  by  properties  of  the  exponential  family  we 

have 


1-    ^^^=E,{TiX)\Y  =  y)    and 

2-  ^^^="^^(^wi^=^)- 


Using  (4.2.2)  and  (4.2.6)  we  can  reexpress  the  density  of  Y  as 


(4.2.7) 


fx\Y{x]y,p) 

a{x)^xp{t'/3-c{^)).lR{x) 
a{x)exp{t>/3-c*{/3;y)).lR{x) 

=  exp{c*{^;y)-c{f3)) 
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Our  objective  is  to  maximize  /y (j/;  /3)  with  respect  to  fi.  Or,  equivalently, 
we  are  to  maximize  the  log  hkelihood 

iY{/3;y)  =  c*{f3;y)-c{f3)  (4.2.8) 

with  respect  to  /3. 

For  well  behaved  iyi/^jy)  we  can  find  the  value  of  /?,  say  ^,  that 
maximizes  it  by  solving  the  score  equations 

5.(^;v)  =  |M/5;i/)  =  ^^-^  =  o.  (4.2.9) 

Notice  that  by  properties  given  in  (4.2.5)  and  (4.2.7),  this  is  equivalent  to 
solving  the  equation 

5y (/3;  y)  =  Ep{T{X)\Y  =  y)-  Ep{T{X))  =  0.  (4.2.10) 

There  are  many  ways  to  solve  (4.2.10).  One  possibility  is  to  use  the  following 
iterative  scheme: 

(1)  YindE^UnX)\Y  =  y) 

(2)  Solve  for  ^("+1)  in  E^(.+,)(T(X))  =  Ep,.,{T{X)\Y  =  y)  (4.2.11) 

(3)  If  ||^('')  -  Z^^^+i)!!  >  TOL  then  replace  /3('')  by  ^("+1)  and  go  to  (1). 
Else  stop. 

We  show  in  Appendix  B  that  the  iterative  scheme  (4.2.11)  is  simply  the  EM 
algorithm.  The  convergence  properties  are  discussed  in  Sundberg  (1976). 

One  important  note  with  regard  to  the  EM  algorithm  (4.2.11)  is  that  if 
iyifi'^y)  is  not  so  well  behaved,  e.g.  the  score  vector  Sy^fi'iy)  has  multiple 
roots  some  of  which  may  be  associated  with  a  minimum,  then  the  particular 
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solution  /9,  obtained  via  the  EM  algorithm,  will  be  a  local  majcimum  likelihood 
estimate.  This  follows  since  the  likelihood  increases  monotonically  with  each 
successive  EM  iteration. 

Upon  convergence  of  the  algorithm,  we  can  use  the  negative  Hessian 
matrix  evaluated  at  /3  to  estimate  the  observed  information  matrix  based  on 
the  incomplete  data.  The  negative  Hessian  is 


MAv)  =  -5|;^M^;y)  =  ^-9;f 


=  var^(r(X))  -  var^(T(X)|y  =  y)       (4-2-12) 


def 


=lx{M-lx\Y{f3;y) 

This  expression  for  the  negative  Hessian  was  noted  by  Sundberg  (1974). 
He  referred  to  the  matrix  Ix\y  ^^  ^  measure  of  information  loss.  With 
regard  to  lost  information,  let  us  suppose  the  observed  data  Y  are  such 
that  T{X)  =  g{Y).  That  is,  the  sufficient  statistic  for  y5  is  a  function  of 
Y.  Intuitively  we  would  expect  no  loss  of  information  since  we  are  able  to 
observe  the  sufficient  statistic  and  hence  we  expect  Ix\y  to  be  identically  the 
zero  matrix.  In  fact,  this  follows  since  T{x)  is  constant  on  R  =  {x  :  Y[x)  =  y} 
whenever  T{x)  =  g{y).  Hence  c*{j3;y)  —  exp{t'f3)  Jj^a{x)dp{x)  which  is  linear 
in/?.  Thus 

^x\Y[fJ,y)-    Q^,Q^    -0- 

In  view  of  equation  (4.2.9),  instead  of  using  the  iterative  scheme  de- 
scribed in  (4.2.11),  we  could  work  directly  with  the  incomplete  data  likelihood 
^y  (y9;  y)  and  implement  a  Newton- Raphson  or  Fisher-scoring  algorithm  to  find 
a  root  to  the  nonlinear  equation.  The  program  NLIN  described  in  Appendix  B 
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can  be  used  to  this  end.  Notice  that  both  Sy{I3\  y)  and  /y  (yS;  y)  (or  a  numerical 
approximation  thereof)  would  need  to  be  computed  at  each  iteration. 
Specifically,  the  iterative  scheme  can  be  written  as 

(1)  Compute /3(''+i)=/3M  + (Ay (^('');y))-^5'y(/3W;y)  ^ 

(2)  If  II^H  -  /5(''+i)||  >  TOL  then  replace  /JW  by  /3(''+i)  and  go  to  (1). 
Else  stop.  (4.2.13) 

where  -Ay  (yS;  y)  =  lyil^'i  y)  if  the  Newton-Raphson  method  is  used,  Ay  (/3;  y)  = 
E^{Iy{/3;  Y))  if  the  Fisher-scoring  method  is  used,  or  Ay{I3;  y)  is  a  numerical 
approximation  to  the  observed  or  expected  information.  See  section  4.5  for 
details  on  the  approximation  method. 

In  section  4.5,  we  consider  an  iterative  scheme  that  is  a  modifica- 
tion/combination of  the  two  schemes  (4.2.11)  and  (4.2.13).  The  modified 
algorithm  for  solving  (4.2.10)  exploits  the  virtues  of  both  these  iterative 
schemes. 

4.3     Loglinear  Model  Fitting  with  Incomplete  Data 

We  investigate  more  closely  the  special  case  of  incomplete  Poisson  data 
with  means  following  a  loglinear  model.  The  assumption  that  the  complete 
data  are  distributed  as  product  Poisson,  i.e.  the  components  are  independent 
Poisson  random  variables,  is  not  as  restrictive  as  it  seems.  We  use  results 
of  Birch  (1963)  and  Palmgren  (1981)  to  show  that  maximum  likelihood 
inferences  about  the  parameters  that  are  not  fixed  by  sample  design  are  the 
same  whether  the  data  are  product  Poisson  or  multinomial.  To  this  end, 
we  derive  an  expression  for  the  variance  of  the  multinomial  cell  probability 
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estimates  when  the  model  parameters  are  estimated  under  the  product 
Poisson  assumption. 

Section  4.3.1  shows  that  the  EM  algorithm  takes  on  a  particularly  simple 
form  when  the  complete  data  are  assumed  to  be  product  Poisson  with  means 
following  a  loglinear  model.  In  section  4.3.2  we  derive  an  explicit  formula  for 
the  observed  information  matrix  that  is  based  on  the  observable  incomplete 
data.  Section  4.3.3  discusses  inferences  for  multinomial  loglinear  models, 

4.3.1     The  EM  Algorithm  for  Poisson  LogUnear  Models 

Let  X  =  {Xi,X2,..  .,Xn)  represent  the  "complete"  data  vector  of  cell 
counts  and  suppose  that 


Xi  ~  indep.  Poisson(/^i),  i  =  1, 2, . . . , 


n 


where  /ij  =  /^i(iS)  satisfies  the  loglinear  model  log/z(/3)  =  Z^.  Here  Z  is  some 
nxp  full  rank  model  matrix  and  ^  is  a  p  x  1  parameter  vector. 

Suppose  only  certain  disjoint  sums  of  X  are  observable.  Let  Y  = 
{Yi,Y2,...,Ym)  =  LX  denote  the  observable  (or  "incomplete")  data.  Here 
L  is  an  m  X  n  matrix  (m  <  n)  that  satisfies  the  following  three  properties: 

(1)  Each  element  is  a  '0'  or  a  '1' 

(2)  There  is  at  most  one  '1'  per  column  (4.3.1) 

(3)  There  is  at  least  one  '1'  per  row 

Properties  (1)  and  (2)  of  (4.3.1)  ensure  that  the  components  of  Y 
are  independent  Poisson  random  variables  while  property  (3)  precludes  a 
noninformative  row  of  zeroes. 
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Denote  realizations  of  X  and  F  by  x  and  y.  The  objective  of  this  section 
is  to  find  the  maximum  Hkelihood  estimate  of  (3,  denoted  by  $,  based  on  the 
observed  data.  Writing  the  density  of  the  complete  data  X  as 

fx{x;(3)  =  a{x)  ■  exp(x'Z^  -  I'e^^)  (4.3.2) 

we  see  that  fx  has  form  (4.2.4)  and  that  a  sufficient  statistic  for  /3  is  Z'X.  It 
follows  by  (4.2.8)  that  Y  =  LX  has  log  likelihood  of  the  form 

iYW;y)  =  c*{^;y)-c{^).  (4.3.3) 

where  c*  and  c  are  functions  defined  in  section  4.2.2.  But,  by  properties  of  the 
matrix  L,  we  know  that  Y  has  a  product  Poisson  distribution.  Specifically, 


1^   ~   ind  Poisson(L^/i),    i  =  l,..., 


m 


where  L'-  is  the  i*'*  row  of  L  and  fx  is  the  vector  of  complete  data  means.  Since 
the  complete  data  means  are  a  function  of  some  model  parameters  through 
log(/i)  =  Z(3,  we  have  that  L[^  =  L^exp{Zl3).  It  is  important  to  note  that 
log(L'^/i)  is  generally  a  nonlinear  function  of  (3.  For  this  reason,  the  model 
fitting  is  somewhat  more  complicated. 

Using  the  fact  that  Y  is  product  Poisson,  we  have  that  the  log  likelihood 
of  Fis 

m  m 

iY{f3; y)  =  Y.yi log(L; exp(Z/3))  -  J] LJ exp(Z/3)  +  h{y)  (4.3.4) 

1  1 

where  the  function  h{y)  is  independent  of  the  parameter  (3.  Now,  we 
differentiate  equation  (4.3.4)  with  respect  to  ^  to  obtain  an  expression  for 
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the  score  vector.  It  is 

n  m  m 

m  m 

m  m 

(4.3.5) 
where  in  the  last  hne  '•'  and  ' — '  represent  componentwise  operators.  As 
shown  in  section  4.2.2.  equation  (4.2.10),  the  log  likelihood  of  the  incomplete 
data  can  alternatively  be  expressed  as 

^h{/3;y)  =  E^{Z'X\Y  =  y)-  Ep{Z'X) 

since  dc*{p)ld(3  =  Ep{Z'X\Y  =  y)  and  dc{p)ld(3  =  E^{Z'X).    Evidently, 
since  Ep[Z'X)  —  Z'jj.,  it  must  be  that 

Ep{Z'X\Y  =  y)  =  Z\ii-  (1„  -  L'U  +  L\-^))\  (4.3.6) 

Therefore,  the  EM  algorithm  is  simply 

(1)  Find  ^'[).(/3H) .  (1„  -  L'U  +  i^\Yi;^^)\ 

(2)  Solve  for  /?(''+^)  in  Z'/i(/5(''+^))  =  Z%{^^^^)  •  (l„  -  L'lm  +  L'{jj^^))] 

(3)  If  ||/3('')  -  /3(''+i)||  >  TOL  then  replace  /?(")  by  /3(''+^)  and  go  to  (l). 
Else  stop.  (4.3.7) 

In  practice,  finding  a  reasonable  starting  value  for  (3,  say  /5(°^,  is  very 
difficult.  However,  in  view  of  the  first  step  of  the  EM  algorithm,  we  need 
only  be  concerned  with  an  initial  estimate  of  /i.  Notice  that  if  ^("^ ,  the  initial 
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guess  for  fi,  satisfies  L/iC)  =  y  then  we  have  tacitly  chosen  an  appropriate  ySC') 
to  start  the  algorithm.  This  is  so  since  we  can  go  to  step  (2)  of  the  algorithm 
and  calculate  /3(^)  the  solution  to  the  equation.  In  fact, 

/3(^)==(Z'Z)-iZ'log/i("). 

Thus,  the  EM  algorithm  has  the  nice  feature  that,  not  only  is  it 
insensitive  to  starting  values,  but  also  reasonable  starting  values  are  simple 
to  find.  A  FORTRAN  program  'em.loglin'  has  been  written  to  actually 
implement  the  EM  algorithm  as  defined  in  (4.3.7). 

4.3.2     Obtaining  the  Observed  Information  Matrix 

In  the  previous  section  we  showed  how  one  can  obtain  maximum  likeli- 
hood estimates  of  the  loglinear  model  parameters  using  the  EM  algorithm.  In 
this  section  we  address  the  major  drawback  of  the  EM  algorithm;  an  estimate 
of  the  precision  of  these  ML  estimates  is  not  obtained  as  a  by-product  of  the 
algorithm.  We  derive  an  explicit  formula  for  the  observed  information  matrix 
associated  with  the  loglinear  model  parameters  that  is  intuitively  appealing 
and  simple  to  evaluate.  Upon  convergence  of  the  EM  algorithm  the  observed 
information  matrix  is  evaluated  at  the  ML  estimates  and  inverted.  The 
inverse  information  can  be  used  as  an  estimate  of  precision  (Agresti,  1990). 
Notice  that  in  this  section  we  consider  using  the  observed  information  rather 
than  the  expected  information.  We  follow  the  lead  of  Efron  and  Hinckley 
(1978)  which  builds  a  case  for  the  preferred  use  of  the  observed  information. 
If  desired,  however,  the  expected  information  can  easily  be  computed  since 
the  observed  information  is  shown  to  be  a  linear  function  of  the  incomplete 
data. 
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Recall  the  setup  in  the  previous  section.  Only  disjoint  sums  of  a  complete 

data  vector  X,  which  is  product  Poisson,  are  observable.  The  complete  data 

means  are  assumed  to  follow  a  loglinear  model  of  the  form  log/i  =  Zf^.  By 

expression  (4.2.12)  of  section  4.2.2.    we  see  that  the  observed  information 

matrix  based  on  the  incomplete  data  has  form 

/y(^;  y)  =  var^(Z'X)  -  yaTp{Z'X\Y  =  y) 

=  Ix{l3)  -  (Adjustment  Matrix) 
This  expression  is  intuitively  appealing  since  yaxp{Z'X)  =  Z'D{fi)Z  is  the 

expected  (and  observed)  information  for  /3  treating  the  complete  data,  X,  as 

if  it  were  observed,  while  var^(-^'X|y  =  y)  is  an  adjustment  that  is  necessary 

because  we  do  not  actually  observe  X  but  only  LX  =  Y.    The  amount  of 

information  lost  by  observing  only  Y  is  determined  by  the  conditional  variance 

of  the  sufficient  statistic  Z'X  given  LX  =  y. 

At  this  point,  one  could  derive  a  formula  for  the  adjustment  matrix  as 

in  a  technical  report  by  the  author.  The  gist  of  the  argument  was  that  the 

distribution  oi  X\Y  =  y  has  a  simple  form  when  Y  represents  disjoint  sums  of 

the  independent  Poisson  random  variables  X  and  so  the  conditional  variance 

of  X  (or  Z'X)  given  Y  =  y  can  easily  be  computed.  A  main  result  of  that 

technical  report  was  that 
coy{Xa,Xi,\LX  -y) 

where  /(.^  is  the  indicator  function  and  r{j)  is  defined  as  follows: 


ij)  =  { 


row  number     in  which  '1'  occurs,  for  the  j*^  column  of  L, 
0,  if  a  '1'  does  not  occur  in  column  j  oi  L. 
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In  this  dissertation  we  will  take  a  different  approach.  The  explicit 
form  of  the  score  statistic  for  Y  was  derived  in  equation  (4.3.5).  Since  the 
observed  information  is  nothing  but  the  negative  Hessian  of  the  log  likelihood, 
we  can  obtain  an  explicit  formula  for  the  observed  information  by  simply- 
differentiating  the  negative  of  the  score  function  with  respect  to  (3'.  The 
appendix  shows  how  one  arrives  at 


M/?;y)  =  -^M/5;y) 


(4.3.9) 


Z'D{^i)L'D{-^^)LD{^^)Z  -  Z'D{L\^^))D{^)Z, 


where  the  ' — '  in  the  last  line  represents  componentwise  division. 

Notice  that  the  expected  information  matrix  has  a  particularly  simple 
form,  viz. 

Ep{Iy{^-Y))  =  Z'D{^,){±^^UL\)D{^,)Z  ^^^^^^ 

=  Z'D{fx)L'D-^{Ln)LD{n)Z. 


Using  either  of  the  results  in  (4.3.8)  or  (4.3.9),  we  derive  an  explicit  form 
for  the  observed  information  matrix  for  several  examples. 
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Example  1:         Missing  Components — When  certain  components  are 
unobservable,  L  will  be  an  identity  matrix  with  rows  missing.     It  follows 
that  the  observed  information  matrix  is 

/y(^;y)  =  Z'D{^)Z  -  Z'D{Mfi)Z 

where  M  is  a  diagonal  matrix  with  j*''  diagonal  element  {M)jj  =  1  -  I{r{j)  > 
0). 

Example  2:  Latent  Class  Models — Suppose  that  counts  resulting  from 
a  cross-classification  on  several  factors  are  observable  and  that  classification 
on  an  additional  iiT-level  latent  factor  is  unobservable.  We  let  the  subscript  i 
represent  a  compound  subscript  identifying  classification  on  observable  factors 
while  the  subscript  j  indexes  the  K  latent  classes.  Denote  the  complete 
data  vector  of  cell  counts  by  X  =  (Xn, . .  .,Xijt-,  . .  .,Xmi,  •  •  .,Xn,j^)^  = 
{Xij}  and  the  incomplete  data  by  y  =  {Xi+}.  Notice  that  Y  =  LX 
where  L  =  1^  (gi  /„,.  One  possible  latent  class  model  assumes  the  means 
of  the  unobservable  complete  data,  say  ^ij,  follow  a  loglinear  model  that 
implies  conditional  independence  of  observed  factors  given  the  latent  factor 
classification  (Haberman,  1979).  It  follows  that  the  observed  information 
matrix  is 

Iy{/3;  y)  =  Z'D{fx)Z  -  Z'{  ®T=i  V;)^  (4.3.11) 

where  each  Vi  is  the  covariance  of  a  iiT  x  1  multinomial  vector  with  index 
yi  =  Xi+  and  cell  probabilities  {fiij/  Y,f=i ^^ij^  J  =  ^i-  •■■> K}- 

Example  3:  Partially  Classified  Data  Models — Consider  the  two  factor 
nonignorable  nonresponse  model  with  one  supplemental  margin  (Little  & 
Rubin,  section  11.6,  1987).  The  complete  data  X  are  counts  resulting  from 
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a  cross-classification  on  two  factors  Fi  and  F2,  along  with  a  dichotomous 
nonresponse  indicator  R.  Suppose  the  Fi  classification  is  always  known  and 
that  R  indicates  whether  or  not  the  F2  classification  is  known.  To  make 
inferences  about  the  classification  probabilities  and  missing  data  assumptions, 
Little  &  Rubin  assume  the  complete  data  means  follow  a  loglinear  model. 
Variance  estimates  of  the  loglinear  parameters  are  easily  derived  since  the 
observed  data  have  form  Y  =  LX  and  L  satisfies  (4.3.1). 

4.3.3     Inferences  for  Multinomial  Loglinear  Models 

Previously,  we  assumed  that  the  complete  data  were  distributed  as 
product  Poisson,  i.e.  the  complete  data  components  axe  independent  Poisson 
random  variables.  However,  the  sample  size  is  often  fixed  by  design  so  that 
the  distribution  of  the  complete  data  vector  may  really  be  multinomial.  This 
follows  since  a  product  Poisson  vector  given  the  total  is  multinomial.  Since 
the  total  sample  size  is  considered  a  random  variable  when  the  product 
Poisson  assumption  is  used,  the  assumption  seems  to  be  unreasonable. 
Fortunately,  Birch  (1963)  and  Palmgren  (1981)  showed  that  maximum 
likelihood  inferences  about  all  of  the  loglinear  parameters  that  are  not  fixed  by 
design  axe  the  same  whether  one  assumes  the  distribution  is  product  Poisson 
or  multinomial.  Therefore,  it  is  general  practice  to  assume  the  data  are 
product  Poisson  since  the  Poisson  distribution  is  in  the  regular  exponential 
family  and  has  an  unconstrained  canonical  parameter.  The  Poisson  loglinear 
model  is  an  example  of  a  generalized  linear  model  (McCullagh  and  Nelder, 
1989)  which  makes  it  simple  to  work  with. 
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In  this  section  we  discuss  making  inferences  about  loglinear  parameters 
when  the  samphng  design  is  such  that  the  total  sample  size  is  considered 
fixed  but  the  data  are  not  completely  observed,  i.e.  there  is  missing  data.  It 
is  not  obvious  that  the  results  of  Birch  extend  to  the  case  of  incomplete  data. 
Therefore,  we  provide  a  detailed  discussion  of  the  extension  to  the  missing 
data  case. 

The  Setup.  In  the  following  argument  we  assume  that  the  matrix  L  is  such 
that  each  column  has  at  least  one  '1'  in  it.  This  requirement  results  in  the 
incomplete  data  Y  =  LX  having  the  same  sum  total  as  the  complete  data, 
i.e.  V^Y  =  I'mLX  =  l'j^X  =  N.  We  also  require  the  loglinear  model  to  include 
an  intercept  term.  This  intercept  term  will  be  the  parameter  that  is  fixed  by- 
design,  since  the  total  sample  size  N  will  be  considered  fixed. 

Full-Multinomial  Sampling.  Suppose  that  the  complete  data  vector  X  has 
a  multinomial  distribution,  i.e. 

X  =  (Xi,...,X„)'   r.  Mult(iV,7r(^)), 

where  N  =  V^X  is  the  fixed  total  sample  size  and  tt^O)  =  (7ri(^), . .  .,x„(^))' 
represents  the  vector  of  cell  probabilities  that  satisfy  X)"=i  '^ii^)  —  1-  Since  N 
is  considered  fixed,  it  makes  sense  to  write  the  cell  means  as  fJ'i{d)  =  NTVi^O) 
so  that  X)"=iMi(^)  —  ^-  Assume  also  that  the  cell  means  {fii{0)}  follow  the 
loglinear  model 

^ogfMi{9)^a  +  x'i/3,        i  =  l,...,n, 

where  Xi  is  a.  p  x  1  vector  and  0  =  {a,^')'  contains  the  so  called  loglinear 
parameters. 
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Further,  suppose  that  only  Y  =  {Yi,...,Ym)'  =  LX  is  observable.  The 
matrix  L,  which  is  of  dimension  mxn  (m  <  n)  will  be  required  to  satisfy  the 
3  conditions  of  (4.3.1)  as  well  as 

(4)    L  has  at  least  one  '1'  in  each  column. 

It  follows  that 

y  =  (Fa,...,y;„)'   ~  Mult(iV,L7r(^)), 

where  L'k{6)  =  (Lj7r(^),. .  .,L|j,7r(^))'.  Again  expressing  the  cell  means  as 
r](6)  =  Lyi{9)  =  NLTr{9),  we  have  that  the  incomplete  data  cell  means  satisfy 

rjiie)  =  L[fx{d)  =  L;exp(al„  +X^)  =  exp(a)L;exp(X;9). 

Also,  since  there  is  a  constraint  on  the  fJ,i{9)  there  is  a  constraint  on  the  T]i{9). 
In  fact,  the  rji  satisfy  E,ti  Vi{^)  =  (Ei^i  L'i)f^{9)  =  l'„fx{9)  =  N.  Also,  the  log 
means  satisfy  logr7i(^)  =  logexp(a)L'j  exp(X^)  =  a  +  log(L|  ex-p{X(3). 

Denote  the  model  parameter  space  for  the  multinomial  scenario  by  Qm 
and  notice  that 

QM  =  {e  =  {a,f3'y:   J2rii{e)  =  N}. 

i 

Evidently  the  set  Gj^f  is  constrained  and  so  Qm  is  not  equal  to  the  (jp  +  l) 
dimensional  real  space. 

Consider  the  one-to-one  transformation  9  >-y  9*  =  {t,I3')'  where  r  = 
Y,^'r]i{9).  It  follows  that  under  this  new  parameterization  the  rji  satisfy 

m 

\ogTji{9*)  =\ogT-  log(5]  L;.  exp(X/5))  +  logL;  exp{X0) 


since 


r  =  Yl^m  =  5:(exp(a)L;exp(X/3)) 
>  t 

m 

a  =  logr  -  log(  J^  L'i  exp(X^)) 
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We  will  call  the  new  parameter  space  0^  and  note  that  it  is 

The  incomplete-data  likelihood  under  the  (M)ultinomial  assumption  can 
be  written  in  terms  of  this  new  parameterization  as 

i 

=  j;  y,(  log T  -  log(5]  L;  exp(X^))  +  log L',  exp(X/3))  -  iV log iV 
=  Nlogr  -  iVlogiV  +  X;yaogL;exp(X/3)  -  iVlog(J]L;exp(X^) 

t  i 

=  Y;^yi\ogL[exp{X^)  -  iVlog(5]L;exp(X^))  =  4(yS), 

(4.3.12) 
since  V^*  e  0^,  r  =  N.  Therefore,  the  incomplete-data  multinomial  log 
likelihood  is  independent  of  r.  Also,  since  the  parameter  (3  is  free  of 
constraints,  we  can  maximize  iy  {B*;y)  with  respect  to  d*  by  simply  setting 
T  =  N  and  maximizing  the  unconstrained  function  ^2(/^)  with  respect  to  (3. 
In  this  context  we  refer  to  a  as  being  fixed  by  sample  design  since  it  is  a 
function  of  the  other  parameters  /3  and  the  fixed  sample  size  N. 

Product-Poisson  Sampling.  In  contrast  to  the  first  sampling  scheme,  the 
total  sample  size  is  not  considered  fixed.  Assume  that  the  complete  data 
X  =  (Xi, . . . ,  Xn)'  are  distributed  as  product  Poisson,  i.e. 

Xi   ~  ind  Poisson(/ii(^)),    z  =  l,...,n,  . 

where  the  parameter  0  is  unconstrained  and  the  means  satisfy 

\ogfXi{9)  =  a  +  x'i^,     i  =  l,...,n. 

Again,  we  assume  that  the  complete  data  are  not  observable  and  that 
we  only  are  able  to  see  Y  =  (Yi, . .  .,Kn)  =  LX  with  L  satisfying  the  same 
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four  properties  that  it  did  in  the  multinomial  setup.    The  vector  Y  is  then 
distributed  as  product  Poisson.  Specifically, 

Yi  ~  ind  F  oissoTi[L^  fj,[9)  =  r]i{0)),    i=l,...,Tn. 

The  cell  means  't]i{6)  satisfy  the  model 

logT)i{$)  =  a+\og{L'iex.p{X(3)),    i-l,...,m. 

or  using  the  same  reparameterization  {9  >-^  $*)  as  above 

logr;i(r)=logT-log(2L;.exp(X^))  +  log(L;exp(X^)). 

* 

We  will  denote  the  model  parameter  space  for  the  Poisson  sampling  case  by 

0p  =  {0*  =  {T,f3'y  :    T  e  i2+,/3  G  J?^},  where  the  symbol  R+  represents  the 

set  of  positive  real  numbers.  It  is  important  to  note  that  Qm  t^  ^p  since  ©m 

constrains  r  to  equal  N  while  Qp  requires  r  only  to  be  positive. 

The  incomplete-data  Poisson  log  likelihood  can  be  written  as 

i 

=  Y.yi{\ogr  -  log(53L;exp(X/3))  +  log(L;exp(X^)))  -  J^Vim 

i  i  i 

=  y+ log r  -  r  +  5]  y,  log  (L;  exp(X/5))  -  y+ log  ( 5]  L;  exp(X/3)) 
t  t 

(4.3.13) 
where  i2{(3)  is  defined  to  be  the  multinomial  log  likelihood  in  (4.3.12)  and 
£i(r)  is  the  log  likelihood  for  the  Poisson  random  variable  1+  which  is  the 
total  sample  size  N.  Since  /3  is  unconstrained  for  both  sampling  schemes, 
we  can  find  the  ML  estimates  by  differentiating  (4.3.12)  and  (4.3.13)  with 
respect  to  /3  and  finding  the  roots  of  these  score  functions.  But  the  two  score 
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functions  are  identical  implying  that  the  maximum  likelihood  estimates  of  /3 
are  the  same  for  both  sampling  schemes.  That  is,  if  we  let  /3^^^  and  /3(-^) 
denote  the  ML  estimates  of  fi  under  the  multinomial  and  Poisson  sampling 
schemes,  respectively,  we  have  shown  that  (3^^^  =  /3^^\  Also,  by  (4.3.12)  and 
(4.3.13),  we  see  that  upon  differentiating  a  second  time 

SO  that  the  portion  of  the  information  matrix  that  pertains  to  /3  is  the  same 
for  both  sampling  schemes.  Further,  equation  (4.3.13)  shows  that  the  log 
likelihood  for  incomplete  Poisson  components  can  be  expressed  as  a  sum 
of  two,  parameter  independent,  log  likelihoods.  Thus,  the  parameters  are 
orthogonal  in  that  the  information  matrix  is  block  diagonal,  i.e.  the  parameter 
estimates  are  asymptotically  uncorrelated.  The  inverse  of  a  block  diagonal 
matrix  is  simply  the  block  diagonal  matrix  of  the  individual  inverses.  Hence, 
the  estimated  variance  of  the  ML  estimates  of  13  is  the  same  for  either  sampling 
scheme. 

Cell  Mean  Inference.  Notice  that  not  only  is  ^(^)  =  ^(^)  but  also 
f(-^)  =  f(^)  =  N.  This  follows  since,  in  the  multinomial  case,  r  is  necessarily 
equal  to  the  total  sample  size  iV,  while  in  the  Poisson  case,  ^i(t)  is  simply 
the  log  likelihood  of  the  random  variable  I4.  which  is  Poisson  with  mean 
r,  implying  that  the  ML  estimate  is  f(^)  =  Yj^  =  N .  However,  we  must 
acknowledge  the  fact  that  the  asymptotic  variance  of  r  under  the  Poisson 
assumption  is  approximately  N  (it  is  var(yi|.),  where  Yj^  ~  Poisson{T)), 
while  the  variance  of  f  under  the  multinomial  assumption  is  zero  (var(iV)  =  0 
since  N  is  nonstochastic).    This  is  important  because  inferences  about  cell 
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means  (or  cell  probabilities)  involve  all  of  the  loglinear  parameters,  even  r. 
Thus  the  variance  of  the  cell  mean  estimates  will  depend  upon  which  sampling 
scheme  is  used. 

Briefly,  using  the  EM  algorithm,  we  can  find  the  observed  information  for 
the  loglinear  parameters  {a,/3')'  based  on  the  assumption  that  the  complete 
data  are  product  Poisson.  The  complete  data  means  fii  are  assumed  to  follow 
the  loglinear  model 

log  fXi  =  a  +  x'i^ ,     i  —  l,...,n. 

If  the  sampling  design  is  such  that  X+  =  N,  the  total  sample  size,  is  fixed 
so  that  X  ~  Mult(A^,  7r(a,/3))  then  the  parameter  a  is  'fixed  by  design'. 
Actually,  upon  reparameterization,  we  see  that  /S  is  free  of  constraints  but 
that  a  =  a(y0,iV),  i.e.  a  is  a  function  of  ^  and  N.  In  fact, 

n  n 

«  =  los(5]/^i)  -  ^og{J2exp{x'if3)) 

'  n       '  (4.3.14) 

=  logiV-log(f^exp(x;^)) 
1 

Our  objective  is  to  find  an  estimate  of  the  variance  of  the  cell  mean  estimates  p, 
under  the  multinomial  assunaption.  The  calculation  of  this  variance  estimate 
is  complicated  somewhat  since  the  variance  estimate  of  a  is  different  for  the 
two  sampling  schemes.  It  is  a  simple  application  of  the  delta  method  to  find 
the  variance  of  ft  under  the  Poisson  assumption  since  /i  =  exp(al„  +  X/3). 
This  follows  since  we've  found  the  information  for  (q!,/3)  ajid  hence  the 
estimated  variance-covariance  matrix  of  (a,/?)  based  on  the  assumption  that 
the  complete  data  are  product  Poisson  and  that  the  incomplete  data  are  of 
the  form  Y  =  LX  with  L  satisfying  the  same  four  properties  as  above. 
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Since,  upon  convergence  of  the  EM  algorithm,  we  compute  the  variance- 
covariance  matrix  of  {ct,ld)  under  the  product  Poisson  assumption  only,  we 
must  find  a  way  to  rewrite  /z  in  terms  of  ^  and  N  only.  But  by  (4.3.14)  we 
have  the  relationship 

a  =  logAr-log(|^exp(a:;^)) 
1 

so  that 

jj,  =  exp(al„  4-  X$) 

-H''^''''''^'^''''^^^)^^''')  (4.3.15) 

^  ^/    exp(X^)    \  ^  ^/_exp(X^\ 

Now  since  the  information  for  /3  is  the  same  under  both  sampling 
schemes,  we  can  find  an  estimate  of  the  variance  of  (x  assuming  the  complete 
data  axe  multinomially  distributed.  We  will  actually  find  the  variance  of  it, 
which  is  nothing  but  ^  =  (exp(X/9)/lJ,exp(X^)),  via  the  delta  method. 

Delta  Method.  Since  the  ML  estimate  ^  is  consistent,  a  first  order 
approximation  to  tv  can  be  found  by  using  a  Taylor's  expansion  about  the 
true  parameter  value  /3o,  viz. 

Thus,  the  variance  of  %  is  approximately 

vax(^)«var(7r(/3o)  +  |j|^„(^-A)) 

where  var(^)  is  that  portion  of  the  variance-covariance  matrix  of  {a,f3) 
pertaining  to  JS.  Recall  that  it  was  shown  above  that  this  portion  is  the 
same  for  both  sampling  schemes. 
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It  is  shown  in  Appendix  B  that 

^  =  [D{^)-^^']X  (4.3.16) 

where  X  =  Z\,-l\.  That  is,  X  is  the  design  matrix  with  the  first  column 
deleted.  Hence,  the  variance  of  tt  under  the  multinomial  assumption  can  be 
estimated  by 

varMuit(7r)  =  [D{t<:)  -  7r7r'](Xvar(^)X')[I)(7r)  -  tttt']  (4.3.17). 

4.4     Latent  Class  Model  Fitting — An  Application 

To  further  illustrate  the  utility  of  the  above  results,  we  explore  the  fitting 
of  loglinear  latent  class  models.  For  an  expository  on  latent  class  analysis, 
see  Haberman  (1979). 

Suppose  we  can  observe  (manifest)  factors  Ai ,  A2, . . . ,  Ap  with  Ii,l2,. .  .,Ip 
levels,  respectively,  while  a  latent  factor  W  with  K  levels  is  not  observable. 
Consider  the  set  of  cells,  C  =  {(1, 1, . . .,  1, 1),  (1, 1,. . .,  1,2), . . .,  (Jj, . .  .,7p)} 
resulting  from  a  cross  classification  on  factors  Ai,...,  Ap.  Listing  the  elements 
of  C  in  lexicographical  order,  we  denote  the  first  cell  by  1,  the  second  by  2,  and 
so  on  to  m,  where  m  =  Hf-i  ^i-  With  this  representation  the  complete  data 
(the  K  *m  cell  counts)  are  X  =  {Xn  ,•••■,  -^ijf,  •  •  • ,  -Xmi , . .  • ,  XrnK)^-  The 
observed  data,  Y,  are  the  marginal  counts  collapsed  over  latent  factor  W. 
Here  Y  =  LX  =  (-Xi-|., . . . ,  Xrn+)^,  where  L  =  1^  <^  Im- 

We  initially  assume  that  X  is  composed  of  independent  Poissons  with 
means  following  the  loglinear  model. 
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We  can  use  the  EM-algorithm  of  (4.3.7)  to  derive  9  =  {a,^')'  ajid  equation 
(4.3.11)  to  obtain  an  estimate  of  its  variance.  From  (4.3,8)  the  adjustment 
matrix  is  Z'var(X|LX  =  y)Z  with 


where 


ye.v{X\LX  =  y)  =  ®ZiVi  = 


fVi      0      0 
0      V2     0 


0\ 
0 


\0         0        0       ...       VmJ 


Vi  = 


**i2   /■■,  hi2 


\ 


y»  ni+  ni+ 


-yi 


Mi2  Atig 


/^ix 


(4.4.1) 


Notice  that  Vi  is  the  covariance  of  a  if  x  1  multinomial  vector  with  index 
yi  =  Xt+  and  cell  probabilities  {fiijl iii^,    j  =  1,. . ., K}. 

Let  9  denote  the  final  estimate  of  9  obtained  upon  convergence  of  the 
EM-algorithm.  Using  (4.3,11)  and  (4.4.1),  we  can  derive  an  explicit  estimate 
of  the  variance- covariance  matrix  of  ^.  It  is 

(^Z'D{f,{9))Z  -  Z\  e^i  ViW)))Z^  ~\  (4.4.2) 

which  is  the  inverse  of  the  information  matrix  evaluated  at  9. 

Numerical  Example.  We  consider  the  example  introduced  in  section 
4.1.  The  observed  data  are  counts  resulting  from  cross-classifying  the  216 
respondents  with  respect  to  whether  they  tend  toward  universalistic  (1)  or 
particularistic  (2)  values  in  four  different  situations  (A,B,C,D)  of  role  conflict. 
The  data  are  displayed  below  in  Table  4.1. 
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Table  4.1.  Observed  cross-classification  of  216  respondents  with  respect  to 

whether  they  tend  toward  universahstic  (1)  or  particularistic  (2)  values  in 

four  situations  (A,B,C,D)  of  role  conflict 


Observed 

Observed 

A      B 

C 

D 

frequency 

A 

B 

C 

D 

frequency 

1        1 

1 

1 

42 

2 

1 

1 

1 

1 

1        1 

1 

2 

23 

2 

1 

1 

2 

4 

1        1 

2 

1 

6 

2 

1 

2 

1 

1 

1        1 

2 

2 

25 

2 

1 

2 

2 

6 

1        2 

1 

1 

6 

2 

2 

1 

1 

2 

1        2 

1 

2 

24 

2 

2 

1 

2 

9 

1        2 

2 

1 

7 

2 

2 

2 

1 

2 

1        2 

2 

2 

38 

2 

2 

2 

2 

20 

We  illustrate  the  results  of  the  previous  sections  by  fitting  a  simple 
loglinear  latent  class  model  to  the  data.  The  ordinary  two-level  latent  class 
model  fitted  by  Goodman  is  equivalent  to  the  loglinear  model 

log/i,,-,u  =  t^  +  Xt  +  \f  +  Af  +  Af  +  W  +  Ar  +  ^r  +  A,r  +  A,?^,  (4.4.3) 

where  i,j,k,l,  and  t  run  from  1  to  2. 

Using  the  notation  defined  above,  the  set  of  observable  cells  is 
e  =  {(1,1,1,1),(1,1,1,2),(1,1,2,1),...,(2,2,2,1),(2,2,2,2)}  and  m  =  2^  = 
16.  The  complete  data  are  x  =  (xn,xi2, . .  .,xi6i,xi62)^  where  for  instance 
3^42  =  2;ii222  represents  the  count  in  cell  (1,1,2,2,2).  Although,  we  assume 
that  the  complete  data  means  satisfy  the  model  in  (4.4.3),  we  are  only  able 
to  observe  y  —  Lx  where  L  =  I'j  ig)  I\&.  Hence,  we  will  fit  the  model  using 
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the  EM  algorithm  defined  in  (4.3.7).  The  FORTRAN  program  em.loglin  was 
used  to  fit  the  model.  The  input  information  needed  is 


(1)  ^(°),     an  initial  estimate  of  the  complete  data  means 

(2)  m  and  n,  the  length  of  the  observed  and  complete  data  vectors 

(3)  p,    the  number  of  independent  loglinear  parameters 

(4)  Z,  the  design  matrix 

(5)  L,  the  mxn  matrix  that  satisfies  Lx  =  y. 


As  discussed  in  section  4.3.1,  a  simple  initial  estimate  of  ^,  and  hence 
of  /5,  is  one  that  satisfies  L^^")  =  y.  But,  by  simply  allocating  approximately 
a  half  of  each  observed  cell  count  to  the  two  levels  of  the  latent  factor,  we 
can  find  a  ^("^  that  satisfies  L/i(°)  =  y.  This  initial  estimate  of  /i  also  allows 
us  to  omit  the  direct  input  of  the  observed  data  which  can  be  obtained  via 
L^^W  =  y. 

The  two- level  latent  class  model  fit  the  data  well  (G^  =  2.72,  df=  6) 
thereby  giving  us  a  simple  way  of  interpreting  the  association  among  the  four 
situations  of  role  conflict.  Table  2  displays  the  model  parameter  estimates 
and  their  estimated  standard  errors.  To  make  model  (4.4.3)  identifiable,  those 
parameters  not  displayed  in  Table  4.2  were  set  to  zero.  The  last  column, 
entitled  "Unadj  Std  Error" ,  contains  the  standard  error  estimates  that  would 
be  used  if  the  complete  data  were  actually  observed.  These  are  too  small  and 
are  invalid. 
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Table  4.2.  Parameter  and  Standard  Error  Estimates 

Parameter  Estimate         Std  Error       Unadj  Std  Error 


y- 

0.532 

0.491 

0.276 

K 

-0.911 

0.197 

0.177 

Af 

0.712 

0.225 

0.171 

A? 

0.604 

0.212 

0.168 

A? 

1.884 

0.334 

0.237 

Af 

3.160 

0.530 

0.317 

\AW 

-4.032 

3.593 

1.543 

\BW 
^22 

-3.444 

1.151 

0.563 

\CW 
^22 

-3.126 

0.962 

0.518 

^22 

-3.081 

0.603 

0.386 

Estimates   of  certain   classification  probabilities   and   their   estimated 
standard  errors  were  also  computed.  These  probabilities  are  defined  as 

^T  =  ^i+++tK+++t  =  P{A  =  1\W  =  t) 

-^T  =  ^+i++t/7r++++t  =  P{B  =  1\W  =  t) 

^T  =  7r++i+t/7r++++t  =  P{C  =  1\W  =  t) 

-^T  =  7r+++it/7r++++t  =  P{D  =  1\W  =  t) 
The  standard  errors  were  found  using  the  arguments  of  section  4.3.3  and  the 

delta  method.  For  example,  the  conditional  probabilities  have  form 

where  hi  and  62  ^^re  1  x  n  vectors  of  known  constants.    Thus,  by  a  direct 
application  of  the  delta  method,  an  estimate  of  the  asymptotic  variance  is 


var 


\h2'K)        L        (627r)2        J        ^    'i        (627r)2        J 
where  var(7r)  is  the  variance  of  tt  under  the  multinomial  assumption,  i.e. 
equation  (4.3.17).  Actually,  since  the  conditional  probabilities  do  not  involve 
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the  intercept  parameter,  the  variance  of  tt  under  the  Poisson  assumption, 
which  is 

^var(/i)  =  ^D{^i)Zy^v{a/^)Z'D{fx) 

could  be  used  in  expression  (4.4.4)  and  the  result  would  be  the  same.  This  is 
not  true  of  the  marginal  probabilities  which  have  form  6i7r.  An  estimate  of 
the  variance  of  61  tt  is 

var(6i7r)  =  feivar(7r)6i, 

where  var(-n-)  is  the  variance  of  tt  under  the  multinomial  assumption.  The 
estimate  would  be  inflated  if  one  used  the  variance  under  the  Poisson 
assumption,  reflecting  the  stochastic  nature  of  the  total  sample  size.  To 
illustrate,  we  consider  an  extreme  example.  Let  61  =  IJ,  so  that  biit  =  1.0 
with  probability  one.  That  is,  feiTr  is  nonstochastic.  If  we  use  the  multinomial 
variance  estimator  we  get  zero  as  our  estimate  of  the  variance.  This  is  what 
we  know  it  to  be.  On  the  other  hand,  using  the  Poisson  variance  estimator 
we  get  some  positive  value  as  our  estimate  of  the  variance.  This  is  known 
to  be  incorrect.  The  estimated  probabilities  ajid  their  estimated  standard 
deviations  are  displayed  in  Table  4.3. 


Table  4.3.    Classification  Probability  Estimates  (Standard  Errors) 
Latent 
Class  t  x^  -T  -T  -T  -J?'^ 

1  .279  (.058)      .993  (.025)      .940  (.066)      .927  (.066)      .769  (.095) 

2  .721  (.058)      .714  (.040)      .330  (.050)      .354  (.049)      .132  (.038) 


From  these  estimated  classification  probabilities,  we  see  that  level  1  of 
the  latent  class  W  can  be  labeled  the  'universalistic'  level.  That  is,  subjects 
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in  level  1  of  the  latent  class  tend  to  have  universalistic  views  for  all  four 
situations.  Notice  that,  given  a  subject  is  in  level  1  of  the  latent  class,  the 
probability  that  they  respond  'universalistic'  is  estimated  to  be  at  least  .77 
for  each  of  the  four  situations.  Similarly,  one  could  label  level  2  of  the  latent 
class  as  the  'particularistic'  level.  Except  for  situation  A,  the  estimated 
probability  that  an  individual  in  latent  level  2  responds  'particularistic'  to 
the  situations  is  at  least  .65.  Since  the  latent  class  model  (4.4.3)  fits  well, 
we  conclude  that,  given  a  person  is  intrinsically  particularistic  or  intrinsically 
universalistic,  their  responses  to  the  four  situations  (A,  B,  C,  D)  of  role  conflict 
are  independent. 

4.5     Modified  EM/Newton-Raphson  Algorithm 

In  this  section  we  present  an  alternative  root  finding  algorithm  for 
the  incomplete  exponential  family  score  functions  of  equation  (4.2.9).  As 
mentioned  above,  the  EM  algorithm  has  both  positive  and  negative  features. 
Two  very  important  positive  features  are  (1)  the  EM  algorithm  is  insensitive 
to  starting  values  and  (2)  the  EM  algorithm  finds  a  root  that  maximizes 
the  likelihood.  In  contrast,  since  the  incomplete-data  log  likelihood  is  not 
generally  a  concave  function  of  the  parameters,  the  Newton-Raphson  (NR) 
or  Fisher-scoring  (FS)  algorithms  may  not  converge  to  a  maximal  root.  In 
fact,  they  will  be  very  sensitive  to  starting  values  and  may  not  converge  at 
all.  Negative  features  of  the  EM  algorithm  include  its  slow  convergence  and 
lack  of  precision  estimate  by-product.  On  the  other  hand,  the  NR  and  FS 
algorithms  work  well  locally,  in  that  if  we  implement  these  methods  very  near 
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a  maximal  root,  the  convergence,  relative  to  EM,  is  fast  and  an  estimate  of 
precision  of  the  ML  estimator  is  obtained  as  a  by-product. 

In  practice,  the  EM  algorithm  may  quickly  approach  a  small  neigh- 
borhood around  a  maximal  root,  but  then  slowly  converge  to  the  root. 
For  this  reason,  we  present  an  alternative  algorithm  that  uses  both  EM 
iterations  and  NR  (or  some  modified  NR,  such  as  FS  or  quasi-NR)  iterations. 
Specifically,  the  EM  algorithm  will  be  used  initially  and  then,  upon  reaching  a 
neighborhood  of  the  maximal  root,  the  NR  type  algorithms  will  be  employed. 
Meilijson  (1989)  suggested  this  approach  in  a  fine  expository  of  root  finding 
methods  for  incomplete  data  score  equations. 

Recall  that  when  the  complete  data  has  distribution  in  the  regular 
exponential  family  the  incomplete-data  log  likelihood  has  form  (4.2.8),  i.e. 

and  that  the  score  function  has  form 

5y (/3;  y)  =  -^iy{^;  y)  =  E^{T{X)\Y  =  y)-  E^{T{X))  (4.5.1) 

To  solve  for  a  maximal  root  of  (4.5.1)  we  can  begin  by  using  the  EM 
iterative  scheme  described  in  (4.2.11).  We  will  conclude  that  the  iterate 
estimate  is  in  a  sufficiently  small  neighborhood  of  the  maximal  root  as  soon  as 
||/5(m)  _^(m+i)||  <  SWITCH(TOL),  where  SWITCH(TOL)  >  TOL  of  (4.2.1). 
At  this  point,  we  will  employ  the  iterative  scheme  described  in  (4.2.13).  As 
a  first  step  in  (4.2.13),  we  must  calculate  the  matrix  Ay(^("*);t/)  which  is  an 
estimate  of  the  negative  Hessian  of  the  incomplete-data  log  likelihood.  At 
times  the  Hessian  or  expected  Hessian  can  be  explicitly  calculated.   This  is 
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true  in  the  Poisson  loglinear  case  (see  equations  (4.3.9)  and  (4.3.10)).  Thence 
the  matrix  Ay(/9("');y)  can  be  expUcitly  calculated  and  inverted.  Generally, 
however,  the  matrix  Ay  will  only  be  an  approximation. 

Since  both  E^{^{X)SX  =  y)  and  Ep{T{X))  must  be  calculated  during 
the  EM  algorithm,  in  view  of  equation  (4.5.1),  we  must  have  the  ability 
to  calculate  5'y(/3;y)  at  different  values  of  /3.  We  then  could  use  as  an 
approximation  to  /y(/3("*);y), 

^,(;3(™);,)[.J=M^!!Mz|(^il±i£il),    i  =  l,...,p  (4.5.2) 

where  the  bracket  notation  B[i,]  represents  the  i*^  row  of  matrix  B  and 
ej  =  (0, . . . ,  0,  e,  0, . . . ,  0)'  is  a  p  x  1  vector  with  a  small  number  e  in  the  i*'' 
position.  The  value  of  e  should  be  determined  by  rules  used  for  numerical 
differentiation.  Meilijson  (1989)  discusses  this  approximation  technique  and 
refers  to  it  as  EM-aided  differentiation. 

Evidently,  if  one  uses  approximation  (4.5.2),  the  only  functions  needed 
to  be  calculated  for  (4.2.13)  are  the  score  functions  which  are  differences 
between  the  conditional  and  marginal  expected  values  of  the  sufficient  statistic 
T{X).  Finally,  upon  convergence  of  (4.2.13)  we  can  use  [.Ay(/3(°°);y)]~^  as  an 
estimate  of  the  precision  of  the  ML  estimates  /3. 

If  one  feels  the  EM  algorithm  will  converge  quickly  enough  or  that 
the  matrix  inversion  of  Ay  is  unnecessarily  burdensome,  then  one  can 
select  SWITCH(TOL)  =  TOL.  In  which  case,  Ay  will  be  inverted  just 
once,  since  the  iterative  scheme  (4.2.13)  will  converge  after  one  iteration. 
For  SWITCH(TOL)  =  TOL,  the  modified  algorithm  is  simply  the  EM 
algorithm  supplemented  by  a  single  calculation  of  a  precision  estimate.    If 
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SWITCH(TOL)  >  TOL,  then  the  EM  algorithm  can  be  viewed  as  a  procedure 
for  finding  an  appropriate  starting  value  for  the  faster  iterative  schemes  such 
as  NR  or  FS. 

The  modified  iterative  scheme  can  be  described  as  follows 

(1)  Solve  for  ^(-+i)  in  E^,^^,,{T{X))  =  E^,r.,{T{X)\Y  =  y) 

(2)  If  ||/3('")  -  /3('"+i)||  >  SWITCH(TOL),  then  replace  ;9("')  by  ^("•+1) 
and  go  to  (1).  Else  go  to  (3). 

(3)  Calculate  [ Ay (/?("*) ;y)]-i  and  5y(/3("»);y)  as  discussed  above.  (4.5.3) 

(4)  Replace  ^("»)  by  /3("'+i)  =  /3M  +  [vly(^("»);  y)]-^5'y(^M;  y) 

(5)  If  ||^('")  -  /3("'+i)||  >  TOL,  then  go  to  (3)  (or  (l))*.  Else  stop. 

'  If  the  faster,  less  stable,  algorithms  are  having  trouble  converging,  reset 
SWITCH(TOL)  to  a  smaller  value  and  reuse  the  EM  algorithm  to  get  into  a 
smaller  neighborhood  of  the  maximal  root. 

Algorithm  (4.5.3)  should  be  stable,  insensitive  to  starting  values,  rela- 
tively fast,  and  will  provide  an  estimate  of  the  precision  of  the  ML  estimate 
as  a  by-product. 

As  a  special  case,  let  us  consider  applying  the  modified  algorithm  (4.5.3) 
to  the  Poisson  loglinear  model  of  section  4.3.  In  that  case  we  were  able 
to  derive  an  explicit  formula  for  the  observed  and  expected  information  for 
the  incomplete  data.  For  simplicity,  we  will  use  the  expected  information 
displayed  in  equation  (4.3.10)  as  our  Ay  matrix,  i.e. 

Ay{(3;  y)  =  E^(/y(^;  Y))  =  Z'D{f,{(3))L'D-'  {Lfx{^))LD{f,{fi))Z.      (4.5.4) 


-170- 
By  expression  (4.3.5),  we  can  write  the  score  function  as 

Sy{(3;y)  =  Zy.L'{y^)],  (4.5.5) 

where  the  '•'  and  ' — '  are  componentwise  operators. 

To  start  the  algorithm,  we  apply  the  EM  iterative  scheme  of  (4.3.7), 
continuing  until  ||/3("*)  -/3('"+i)||  <  SWITCH(TOL).  At  this  point  we  will  go 
to  step  (3)  of  (4.5.3)  using  the  formulas  (4.5.4)  and  (4.5.5)  for  Ay  and  Sy 
Repeat  steps  (3)-(5)  of  (4.5.3)  until  the  convergence  criterion  is  met. 

4.6     Discussion 

This  chapter  emphasized  loglinear  model  fitting  when  the  data  are 
incomplete.  As  an  example,  a  latent  class  loglinear  model  was  fit  to  the 
data  presented  in  Goodman  (1974).  The  primary  method  of  obtaining 
ML  estimates  of  the  loglinear  parameters  was  the  EM  algorithm,  but  other 
possibilities  such  as  the  Newton-Raphson  algorithm  were  discussed. 

In  section  4.2  we  reviewed  the  EM  algorithm  with  special  attention 
given  to  the  regular  exponential  family.  For  the  regular  exponential  case,  the 
iterative  scheme  (4.2.11)  was  shown  to  be  equivalent  to  the  EM  algorithm. 
Then,  in  section  4.3.1,  we  derive  the  specific  form  for  the  EM  algorithm 
when  the  data  are  product  Poisson  with  means  following  a  loglinear  model. 
An  explicit  formula  for  the  observed  information  matrix  is  derived  in  section 
4.3.2.  An  estimate  of  the  variance  of  the  ML  estimates  of  latent  class  loglinear 
parameters  is  shown  in  equation  (4.4.2). 

The  assumption  that  the  data  are  product  Poisson  is  not  as  restrictive 
as  it  may  seem.  In  section  4.3.3  we  discuss  inference  for  loglinear  parameters 
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when  the  complete  data  are  multinomially  distributed.  The  results  follow  by 
arguments  of  Birch  (1963)  and  Palmgren  (1981).  It  is  shown  that,  when  the 
total  sample  size  is  considered  fixed,  inferences  about  all  loglinear  parameters, 
except  the  one  that  is  fixed  by  design,  are  the  same  for  both  the  product 
Poisson  assumption  and  the  multinomial  assumption.  A  method  of  estimating 
the  variance  of  classification  probability  estimates  (and  functions  thereof)  is 
also  developed  in  this  section. 

We  introduce  an  alternative  root  finding  algorithm  (4.5.3)  for  the 
incomplete  exponential  family  score  functions  in  section  4.5.  The  algorithm 
exploits  the  positive  features  of  both  the  EM  and  Newton-Raphson  type 
algorithms.  Specifically,  the  algorithm  should  prove  to  be  insensitive  to 
starting  values  and  relatively  fast  (compared  to  straight  EM).  It  also  will 
provide  an  estimate  of  the  precision  of  the  estimators  as  a  by-product. 

As  mentioned  above,  many  models  that  can  be  fit  using  the  EM 
algorithm  can  also  be  fit  more  directly  using  the  Newton-Raphson  algorithm. 
Appendix  B  includes  a  discussion  about  the  program  NLIN  which  fits 
generalized  linear- nonlinear  models.  Also  included  in  the  appendix,  is 
the  code  for  the  two  model  fitting  programs  'em.loglin'  and  'NLIN'.  The 
FORTRAN  program  'em.loglin'  is  based  on  the  iterative  scheme  (4.3.6)  and 
the  formula  (4.3.9)  for  the  observed  information  matrix.  The  Splus  program 
'NLIN'  can  be  used  to  fit  generalized  linear  and  nonlinear  models.  The 
data  are  required  to  be  independent  and  of  the  exponential  dispersion  type 
(see  discussion  of  NLIN).  The  author  plans  on  implementing  the  algorithm 
described  in  (4.5.3)  for  the  Poisson  loglinear  model  case. 


APPENDIX  A 
CALCULATIONS  FOR  CHAPTER  2 


We  set  out  to  show  that  the  matrix  of  equation  (2.3.11),  viz. 

_ff^  0       1       V  0  0/  [-^^         0 

is  equal  to  the  matrix 

/Ml      0  \ 
1,  0      M2) 

where 

Ml  =  D-'iiro)  -  D-\7ro)H{H'D-'{TVo)H)-'H'D-\Tro)  -  ®^1r1'r 

and 

M2=nl{H'D-\To)H)-\ 

Proof:  For  notational  convenience,  let  D  =  D^ttq)  and  let  H  =  H{^o)-  We 
will  state  a  basic  matrix  algebra  result,  the  proof  of  which  can  be  found  in 
Aitchison  and  Silvey  (1958). 

Let   A   be   nonsingular  and   B   be   of  full   column  rank.      Assuming 
compatibility 

[A      -BY^  _fA-^-A-^B(B'A-^B)-'^B'A-^     A-^B{B'A-^B)-^\ 
\-B'      0)      ~\  {B'A-^B)-^B'A-^  -(B'A-iS)-i     )• 

That  is,  the  partitioned  matrix  has  a  simple  inverse. 
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Using  this  result,  identifying  D  and  Hjrit,  with  A  and  5,  we  arrive  at 
an  equivalent  form  for  (2.3.11).  It  is 

\  n^{H'D-^H)-^H'D-^  -nl{H'D-^H)-^      J  ^ 

/ D  -  ®7Voi7r'oi     0\ 

(D-^  -  D-^H{H'D-^H)-^H'D-^     n,D-^H{H'D-^H)-^  \ 
\  n^{H'D-^H)-^H'D-'^  -nl{H' D'^  H)-'      )' 

Now,  using  the  fact  that  D~'^{'Ko){  ®  7roi7r[,^)£)~^(7ro)  =  ©Ifll'jj  = 
(©1^)(©1^)  and,  by  Lemma  2.3.1,  {®l'jf)H  =  0,  we  can  multiply  out  these 
three  partitioned  matrices  to  get 


(M,      0  N 
I,  0      M2) 


where 


Ml  =  D-^-K,)  -  D-\'k,)H{H'D-\'k,)H)-^H'D-\t^,)  -  ©f  1^1'^ 

and 

M2=nl{H'D-\'Ko)H)-\ 

This  is  what  we  set  out  to  show.  g 

Result  3  (2.4.6)  We  wish  to  show  that  the  asymptotic  variances  are  related 

according  to 

-(P)-(P)' 
var(/i(^))  =  var(/i(^))  -  ef^'    ^'      . 

TT-t 
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Proof:    Since  fj,  —  e^,  we  can  invoke  the  delta  method  to  arrive  at 
var(/i(^>)  =  var(e^^''^)  =  I>(e^'')var(|(^))r>(e^'') 

=  D(e^'')('var(e(^))  -  el^^£)(e^<')^     by  2.4.5 

=  I>(e^<')var(e(^))D(e^'')  -  ®^2i5oi 

77.J 

=  var(A(^))-8^'     ^'      , 
where  the  equal  signs  represent  asymptotic  equivalence.  g 

Result  4  (2.4.7)    We  wish  to  show  that  the  asymptotic  variances  of  the 
freedom  parameter  estimates  are  related  according  to 

var(^(^))  =  var(^(^))  -  A, 

where 

/q-*"*!    \  -1/  -It 

A  =  (X'X)-'X'C       f"'  \(®  ^,  ®^)C'X(X'X)-\ 

Proof:  In  the  following,  the  equal  signs  represent  asymptotic  equivalence. 
Now,  since  /3  =  (X'X)~^X'Clog(A/i),  we  can  invoke  the  delta  method  to 
arrive  at 

var(^(^)) 

=  (X'X)-^X'Cvar(  log{Afi<^^^))C'X{X'X)-'' 

=  {X'X)-'X'CD-'{A^o)AyaT{fi^^^)A'D-^{AfXo)C'X{X'X)-'' 

=  {X'X)-^X'CD-\Afio)AyaiT{fi^P^)A'D-\Afio)C'X{X'X)-^ 

-  {X'X)-'X'CD-\Afj,o)Al  8  ^'  ^'      jA'D-'{Afio)C'X{X'X)-' 

=  var(^(^)) 

.      -(p)-(p)'. 

-  {X'X)-'X'CD-'{Afio)A(  e  ^'  ^^      jA'D-\A^io)C'X{X'X)-\ 
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But  by  assumption  (Al)  of  section  2.3.1, 


_^-i  (®A,jfiA  (®A,A  (     J^ 


V        V"; 


M^    y,A 


Hence,  we  have  that  the  asymptotic  equivalence 


var(^W)=var(^(^))-A 


holds,  where 


(a^h 


k  =  {x'XY'^x'c 


which  is  what  we  set  out  to  show. 


fe^,  ®^]c'X{X'x) 


-1 


APPENDIX  B 
CALCULATIONS  FOR  CHAPTER  4 


We  prove  that  the  four  properties  of  the  EM  algorithm  introduced  in 
section  4.2.1  do  indeed  hold.  These  proofs  are  essentially  those  of  Dempster 
et  al.  (1977)  and  Little  and  Rubin  (1986). 

Property  1.    If  ^("*)  and  ^('"+i)  are  the  tti*''  and  m  +  1**  iterate  estimates 
obtained  via  the  EM  algorithm  then 

^y(^(-+l);y)>^l.(^("»);y); 

i.e.  the  log  likelihood  is  increased  at  each  successive  iteration. 

Proof:    As  in  section  4.2.1,  we  write  the  incomplete  data  log  likelihood  as 

Now,  by  Jensen's  inequality,  H{e,  ^('");  y)  <  if  (^('"),  ^('");  y),    V^.  This  follows 
since 
if(«(".),S(".);j,)  =  E^„,{exir(0^"');X)\Y  =  y) 

>H{9,e(^yy) 

(5.1) 
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where  the  last  inequality  holds  since  the  'log'  function  is  concave  whereby 
Jensen's  inequality  tells  us  that 

Now,  equation  (B.l)  holds  at  ^  =  ^('"+1),  i.e. 

Therefore, 

^y (^("'+1); y)  =  Q(^("'+i),  ^M;  y)  -  iJ(^('"+^),  ^("»);  y) 

>  (5(^('"+i),  ^('");  y)  -  F(^("»>,  ^("');  y) 

>  Q(^('"),  ^M;  y)  -  if  (^('"),  ^("*);  y) 

=  £y(^M;y) 
where  the  second  inequality  follows  since  ^('"+^)  is  defined  to  be  that  value  of 
9  that  maximizes  the  function  Q(^,^('");y).  Hence  we  have  shown  that 

£y(^('"+i);y)>^y(^M;y). 


Property  2:    The  sequence  of  EM  iterates  {9^"^\m  >  1}  satisfy,  whenever 
^("i)  converges  to  6^°°^  as  m  — >  00, 

d 


89 


^y(^;y)U)-5y(^(~);y)  =  0. 


i.e.  the  estimates  converge  to  a  zero  of  the  score  vector  for  Y . 

Proof:    Using  Property  3,  we  can  write  the  score  vector  for  the  incomplete 
data  as 

Sy{9^^\y)  =  §-^Q{9,9^-\y)\^^y 
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But  this  implies  that 

since 


d_ 
de 

Therefore, 


=  5y(^(-);y)-4iy(^,^(-);y)|^.,. 


\rf.l) 
=  0  +  o(l;m-^  oo) 

since,  by  definition  of  ^('"+^), 

d 


dd 


Q(^,^(-);y)|^„^,)=0, 


and  because  as  ||^("»)  -  ^("»+i)||  goes  to  zero  the  function  ^if(^,^("*);y)|^„+i) 
goes  to  zero.  But  by  convergence  properties  of  the  EM  algorithm  H^^"*)  - 
^(m+i)||  ^  0  as  m  ^  oo.  Thus  equation  (B.2)  holds  and  is  tantamount  to 


^^y(^;y)Uoc)  =  o. 


Property  3:    For  any  Oq, 


^[W,^o,y)l.o]  =  SY{d,-y)  =  EeSSx{B,-,X)\Y  =  y). 
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Proof: 


=  Ee,iSx{Oo;X)\Y  =  y) 

=  J  SxiOQ;x)fx\Yi^;y,Oo)du 


jRfx{x;0)du 


9o 


Property  4:    For  any  ^O) 

/y(^o;y)  =  Ee,{Ix{Oo;X)\Y  =  y)-  var,„(5x(^o;X)|y  =  y). 

Proof:    Since  the  observed  information  matrix  is  the  negative  Hessian  of 
the  log  hkeUhood,  we  have  that 


lY{0;y) 

dO'de 


^^M^My)-^MdMy) 


dO'de 


a^'a^- 


^^°(-al^^^(^'^)l^  =  ^)-^^»(-M^^^i^(^'^'^)l^  =  ^) 


=  Ee,{lx{0:  X)\Y  -  y)  -  E,„(/;,|y(^;y,X)|y  =  y). 
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But 


=  Ee,([Sx{eo;X)-SY{eo;y)]x 

[Sx{do;X)-SYieo;y)]'\Y  =  y^ 
=  Eo,{[Sx{eo;X)-Ee,{Sx{eo;X)\Y  =  y)]x 

[Sx{eo;X)-Ee,{SxiOo;X)\Y  =  y)]'\Y  =  y^ 
=  Eo,{Sx{eo;X)S'j,{eo;X)\Y  =  y) 

-Ee,{Sx{eo;X)\Y  =  y)Ee,{S'j,{eo;X)\Y  =  y) 
=  var,„(5x(^o;X)|F  =  2/). 


Hence 


lYiOo;y)  =  Ee,iIx{0;X)\Y  =  y)-YaTe,{Sx{eo',X)\Y  =  y), 

which  is  what  we  set  out  to  show.  g 

Theorem:    //  the  complete  data  vector  X  has  distribution  in  the  regular 
exponential  family,  i.e.  the  density  function  has  form, 

fx{x-p)  =  a{x)  exp(T'(x)^  -  ci^))  (5.3) 

with  respect  to  som,e  measure,  then  the  EM  algorithm,  can  be  used  to  find  the 
MLE  of  13  based  on  incomplete  data  Y  =  Y{X)  and  the  algorithm,  is  as  stated 
in  (4.2.11). 

Proof:    Sundberg  (1976)  shows  that  the  EM  algorithm  can  be  used  to  find 
the  ML  estimates  of  ^  based  on  incomplete  data.    We  will  show  that  the 
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general  EM  algorithm  of  (4.2.3)  reduces  to  (4.2.11)  when  the  complete  data 
have  distribution  in  the  regular  exponential  family. 
The  general  EM  algorithm  (4.2.3)  is  defined  as 

g(;9("'+i),^(-);y)  =  maxQ(A^(-);y) 

where 

Q(/3,/5('");y)  =  E^U^x{(3-X)\Y  =  y). 

Now  since  X  has  density  of  form  (B.3),  it  follows  the  the  log  likelihood 
ix{li]X)  has  form 

tx{P\X)  =  loga(X)  +T'(X)/3  -  c{l3). 

Hence, 

Q(iS,/5('"^y)  =  ^^(.)(loga(X)|y  =  y)  +/?'E^(.)(T(X)|y  =  y)-  c{(3). 

Now,  since 

Wd^ "  ~WW  ~  -^^'^^^^^^^ 

is  negative  definite,  it  follows  that  the  solution,  say  /3("*+^),  to 

AQ(/3,/3M;y)  =  0 

is  the  value  oi  (3  that  maximizes  the  function  Q{j3,^^"^^;y).  But 

Ag(/5,/3(-);  y)  =  E,UT{X)\Y  =  y)-^ 

=  E^UT{X)\Y  =  y)-E^{T{X)). 
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Hence  ^('"+1)  satisfies 

E^(„+,)(T(X))  =  E^(„)(T(X)|F  =  y) 

which  is  tantamount  to  showing  the  equivalence  of  the  two  iterative  schemes 
(4.2.3)  and  (4.2.11).  ■ 

We  differentiate  the  score  vector  of  equation  (4.3.5)  to  obtain  an  expHcit 
expression  for  the  observed  information  matrix.  Recall  that  we  are  to  show 
that  the  information  matrix  can  be  expressed  as  in  (4.3.9),  viz. 

/y(^;y)  =  Z'D{f.)L'D{-^^)LD{f.)Z  -  Z'D{L'{y^))D{^)Z. 

Proof:    By  equation  (4.3.5),  we  know  that  the  score  vector  for  Y  is 

Now 

dSriM  _  (dSY{^;y)\  ( dfx\ 
d(3'       -\      a/i'      )\d/3')' 

dn  _  dexp{Z(3)  _  j^(   .y 
In  the  following,  denote  the  n  x  1  elementary  vectors  by  e^.  That  is, 

e;  =  (0,0,.. .,0,1,0,. ..,0,0), 
where  the  '1'  is  in  the  i^^  position. 


where 
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We  set  out  to  find  the  derivative  of  the  score  vector  with  respect  to  fi. 
It  is 

=  ...(f;(V^)x.) 

m 

=  Z'D{L'(y^)) 


-  Z'D(i,)VD(-^)L. 


Therefore, 


lY{/3,y) Q^r-  --[     Q^,     )  [e^) 

which  is  what  we  set  out  to  show.  g 

Using  the  delta  method  we  can  find  the  asymptotic  variance  of  tt.  The 
expression  for  the  asymptotic  variance  involves  the  matrix  dir/dP'.  We  show 
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that  equation  (4.3.16)  holds,  i.e. 


d3<  1' u  2  ^    ^   '  J 


Proof:    From  (4.3.15)  we  have  that 
or  equivalently  that 


/i  _  /   exp(X^)    ^ 


.exp{XP)J- 

Here  /3  is  an  unconstrained  parameter  vector  of  length  p.  Notice  that  1^/i  =  N 

and  hence, 

d7^__d_  (   exp(X/3)    \  ^  _a_  /  j/_ 
dl3<  -  d(3<  U;,exp(X/3)  j  ~  d(3'  \V^tx 


^p(.)x(A../.)+.(^)(|,(i») 
=  i^^(^)^-(i;^^^"^(^)^ 


X 


i];^^(^)^-(ii;^^^' 

P(^)(l'„/.)-/i^']X  ^  ^^^^^  _  ^^,j^^ 


DESCRIPTION  OF  COMPUTER  PROGRAMS 

em.loglin.  Briefly,  em.loglin  is  a  FORTRAN  program  that  can  be  used  to 
obtain  ML  estimates  of  loglinear  parameters  as  well  as  an  estimate  of  their 
precision  when  only  disjoint  sums  of  the  complete  Poisson  data  are  observable. 
The  EM  algorithm  (4.3.7)  is  used  to  find  the  ML  estimates  and  expression 
(4.3.9)  is  used  to  calculate  the  precision  estimate.    It  is  assumed  that  the 
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complete  data  X  are  distributed  product  Poisson  with  n  x  1  mean  vector 
/i  following  the  loglinear  model  log/x  =  Z^.    The  incomplete  data  must  be 
expressible  a.s  Y  =  LX  where  L  is  an  m  x  n  matrix  that  satisfies  properties 
(l)-(3)  of  (4.3.1).  The  user  must  input  the  following  information 

(1)  fi^^\  an  initial  estimate  of  the  complete  data  means  that  equation 

1,^(0)  =  y,  i.e.  /LiC)  is  consistent  with  the  observed  data  y 

(2)  m  and  n,  the  length  of  the  observed  and  complete  data  vectors. 

(3)  p,  the  number  of  loglinear  parameters 

(4)  Z,  the  nxp  full  column  rank  design  matrix 

(5)  L,  the  mxn  matrix  that  satisfies  Y  =  LX. 

The  output  includes 

(1)  P,  an  ML  estimate  of  the  loglinear  parameter  vector  /3 

(2)  var(/3),  an  estimate  of  precision  of  the  ML  estimate 

(3)  G^,  the  likelihood  ratio  goodness-of-fit  statistic 

(4)  df,  the  degrees  of  freedom  associated  with  the  null  asymptotic 

Chi-squaxe  distribution  of  G^ 

(5)  /i,  an  estimate  of  the  complete  data  cell  means 

(6)  var(/i),  an  estimate  of  the  precision  of  p,  (Poisson  sampling) 

NLIN  NLIN  is  an  Splus  (Becker,  et  al.  1989)  program  that  fits  generalized 
linear  and  nonlinear  models  to  data  with  distributions  in  the  exponential 
dispersion  family  (J0rgenson,  1989).  We  now  briefly  describe  exponential 
dispersion  models  and  how  to  fit  them. 
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A  General  Algorithm  For  Fitting  Generalized  Linear-Nonlinear  Models.  Let 

i.e.  the  density  function  for  the  random  variable  Yi  has  form 

fiVi)  =  a{yi,<^\wi)eicp{-^{diyi-  K{ei))}, 

where  K{d)  is  the  cumulant  function  and  K'{6i)  =  T{$i)  =  ^,-. 

Suppose  that  each  mean  can  be  expressed  as  an  invertible  function  of 
some  covariate  vector  and  a  p  x  1  parameter  vector,  i.e.  /i^  =  hi{xi,f3)^  i  = 
1, . . . ,  n.  Some  examples  are 

(1)  fii  =  x\j3,     Linear  Model,  Identity  Link 

(2)  /ij  =  exp(xj^).     Linear  Model,  Log  Link 

(3)  /ii  =  exp(x^^)/(l  +  exp(x|/3)).    Linear  Model,  Logit  Link 

(4)  ^i  =  L'-ex.p{Xf3),     Nonlinear  Model,  Log  Link 

Example  (4)  is  nonhnear  when  the  matrix  L  is  some  m  x  n  {m  <  n)  matrix 
satisfying  (4.3.1)  and  is  not  the  identity  matrix.  Note  that  L|  is  the  i*'*  row 
of  the  matrix  L.  In  fact,  the  matrix  L  can  be  chosen  so  that  the  Poisson 
loglinear  latent  class  models  are  a  special  case  of  example  (4). 

Letting  the  vector  h  =  (/ii, . .  .,/i„)'  and  the  symbol  ED  represent  a 
particular  exponential  dispersion  distribution,  we  say  that  {ED,  h}  specifies 
a  generalized  linear-nonlinear  model.  As  a  special  case,  suppose  that  each  hi 
has  a  common  inverse  g  such  that 

We  say  that  the  triple 

{ED,ri  =  g{fi),'ni  =  x[(3},  {BA) 
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specifies  a  generalized  linear  model  (GLM)  (McCullagh  and  Nelder,  1989). 
In  GLM  parlance,  the  function  g  is  known  as  the  'link'  function.  Examples 
include 


(1)  Poisson  Loglinear  Model: 

{Poisson(/i),  77  =  log(/i),  r)i  =  x'-fl} 

(2)  Binomial  Logistic  Model: 

{Binomial(n,  tt),  rj  =  log  j^,  'qi=  x\p} 

(3)  Normal  Linear  Model: 

{Normal(^,  ^),  77  =  /i,  r)i  =  x'-(3} 


Mciximizing  the  Likelihood  Our  objective  is  to  make  inference  about  the 
loglinear  parameters  in  ft  and  hence  about  the  means  fXi.  We  will  base  our 
inference  on  the  maximum  likelihood  estimates  and  their  precision.  Therefore, 
we  must  maximize  the  log  likelihood  with  respect  to  /3.  The  log  likelihood  for 
the  sample  Y  is 

1  1 

where  «'(^i)  =  fJ,i  =  hi{x[,l3). 
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The  score  function  is 


5(/3;v)  =  ^  =  ^E-<ife-''W) 


d(3  a^  Y      ^^ 


n 


where 


V  =  e'lVifii) 

S  =  y-fj, 

w  =  e"  iUi 


D 


_  ^A* 


a/3'' 

Here  the  matrix  Z)  is  referred  to  as  the  'model  matrix'.  The  maximum 
Ukelihood  estimate  may  be  found  by  solving  for  a  zero  of  the  score  function 
(B.5)  (at  least  in  many  cases).  To  solve  for  this  zero,  we  will  use  a  Newton- 
Raphson  type  algorithm  which  will  require  calculation  of  the  Hessian  matrix. 
It  is 


d/3'dp    ~  (j2  d(3 
1  g/x' 


^^'  {{V-^D)  ®h)  +  {S'®  I,)^{V-'WD) 


Q^KK'       "^J^^xj   .   V-  ^-pjgi3^ 


=  1-D'{-V-^WD  +  Z) 


a2 


-^{-D'V-^WD  +  D'Z) 


a^ 


where  E{Z)  =  0  so  that  the  expected  value  of  the  Hessian  is 

eI^^MXI]  =  :1d'v-'wd. 
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Therefore,  for  /?('=)  in  a  neighborhood  of  ^,  the  solution  to  the  score  equation, 
we  have  the  following  linear  approximation 

di3  ^        dl3        ^      d(3<dl3     ^^  ^     ^ 

The  next  estimate  of  ^  will  be  ^(*'+^),  the  solution  to  the  linear  equation 
i:(/?(fc+i))  =  0.  The  solution  is 

^(fc+i)  33  ^(fe)  _,_  [D'WV-'^D)-'^D'WV-^S 

=  {D'WV-'^D)-'^D'WV-\D^^^^  +  S)  (5.6) 

=  {D'WV-^D)-^D'WV-^C^^^ 

where  ^(*)  =  Dfi^^^  +  5   is  a  'local'  dependent  variable. 

The  iterative  algorithm  (B.6),  which  is  the  Fisher-scoring  algorithm,  is 
also  referred  to  as  the  iteratively  reweighted  least  squares  algorithm  (IRLS). 
The  reason  for  this  label  is  evidently  due  to  the  last  expression  in  (B.6). 
For  each  fc,  it  resembles  a  weighted  least  squares  estimate,  where  the  weight 
matrix  is  W"V~^,  the  model  matrix  is  £),  and  the  dependent  variable  is  ^(*). 

Denoting  the  ML  estimate  by  ^,  we  have  that  in  many  situations 

P  ~  AN{P,a''{D'WV-^D)-^), 

i.e.  P  has  an  asymptotic  normal  distribution.  Also,  we  let  a"^  denote  a 
consistent  estimator  of  the  dispersion  parameter  a"^.  For  example,  dividing 
the  deviance  statistic  by  the  degrees  of  freedom  associated  with  its  asymptotic 
distribution  results  in  a  consistent  estimator  of  cr^  (J0rgenson,  1989). 
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By  evaluating  D  and  V  at  /3  and  using  the  consistent  estimator  a',  we 

A 

can  consistently  estimate  the  asymptotic  variance  of  /3  by 

var(;9)   «  a^b'WV-'byK 

The  astute  reader  will  notice  that,  upon  specification  of  the  exponential 
dispersion  distribution,  the  matrix  V  is  determined.  Also,  the  matrix  W 
is  a  matrix  of  known  constants.  Hence,  the  only  matrix  not  determined  as 
yet  is  D,  the  so  called  'model  matrix'.  The  matrix  £)  is  a  function  of  (3  and 
X  through  the  following  function 

^  =  |r  =  |rMX,/3). 

When  the  model  is  of  the  form  (B.4),  i.e.  the  model  is  a  GLM,  we  have  that 

the  model  matrix 

'^-d/3'-\dg'{fx))\  d/3'   ) 
_{dfi\(dri\_(drf\-\ 

and  can  be  calculated  explicitly.  But,  more  generally,  when  the  model  is 
{ED,h},  D  can  not  be  calculated  explicitly  or  at  least  is  very  difficult  to 
calculate  explicitly.  However,  it  can  be  numerically  estimated. 

Numerical  Approximation  to  D.  We  use  a  popular  and  simple  technique  to 
numerically  approximate  D.  Recall  that  D  is  the  matrix  of  partial  derivatives 
of  fj,  with  respect  to  /5.  Hence,  the  problem  is  to  approximate  a  derivative 
matrix.  One  such  estimate,  and  the  one  used  in  the  program  NLIN,  is 

D   ^Dn=   [M/9  +  ei)-/x(/3-60,...,Mi9  +  ep)-/x(/3-6,)]E-^        (S.7) 

where  Ci  =  (0, . . . ,  0,  e,  0, . . . ,  0)'  is  a  p  x  1  vector  with  the  small  constant  e  in 
the  i*'*  position  and  the  matrix  E  is  &  pxp  diagonal  matrix  with  2e  on  the 
diagonal.  Thus  E  =  2elp  =  2[ei, . . . ,  e^]. 
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Now  the  IRLS  algorithm  will  involve  just  one  additional  step  and  that  is 
to  calculate  a  numerical  approximation  to  the  model  matrix  D.   The  actual 
algorithm  used  in  NLIN  is 


(1)  Input  y,w,fi  =  h{X,f3),V{fx),  and  the  deviance  function  Dev{y,w,tJ.) 

(2)  Find  an  initial  estimate  /3^°^  of  (3 

(3)  Compute  D'^^P^  =  Djv(/5("*)),1^("')  =  V{f3("'^),  and 

5("')  =  y-  /x(;5M)  (B.8) 

(4)  Compute  /3("»+^)  = 

(5)  Compute  Z)et;(y,  to, /x("»+i)) 

(6)  If  \\Dev{y,w,fx^"'+^))-Dev{y,w,^i("'))\\  >  TOL,  replace  yS^"*) 

by  (3("i+i)  and  go  to  (3).  Else  stop. 


Notice  that  step  (l)  of  (B.8)  involves  inputting  the  data,  the  weights, 
the  mean  function,  the  variance  function,  and  the  corresponding  deviance 
function.  It  follows  that  this  program  can  more  generally  be  used  to  fit  models 
via  quasi-likelihood  methods  (McCullagh  and  Nelder,  1989).  Another  remark 
is  worthwhile  mentioning.  When  the  model  is  {ED,  g{fj,)  =  ^,  fi  =  Xj3},  i.e. 
a  Linear,  Identity  link  model,  the  numerical  approximation  Dj^  of  D  in  (B.7), 
which  equals  X,  is  exact.  Specifically,  for  the  Normal  Linear  Model 
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the  approximation  is  exactly  equal  to  the  model  matrix  X.  The  argument  is 

as  follows 

{DN)ij  =  M(3  +  ej)  -  /x,(/5  -  e,)]/2||e,|| 

=  [x[(3  +  x'iej-x[/3  +  x[ej]/2e 
=  2x'-ej/2e  =  x'^ej/e 

Thus,  Dn  =  X  =  D. 
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